Skip to main content Accessibility help
×
Hostname: page-component-cb9f654ff-r5d9c Total loading time: 0 Render date: 2025-08-12T08:29:53.051Z Has data issue: false hasContentIssue false

6 - Engineering Implementations in Deep Learning Recommender Systems

Published online by Cambridge University Press:  08 May 2025

Summary

While previous chapters discussed deep learning recommender systems from a theoretical and algorithmic perspective, this chapter shifts focus to the engineering platform that supports their implementation. Recommender systems are divided into two key components: data and model. The data aspect involves the engineering of the data pipeline, while the model aspect is split between offline training and online serving. This chapter is structured into three parts: (1) the data pipeline framework and big data platform technologies; (2) popular platforms for offline training of recommendation models like Spark MLlib, TensorFlow, and PyTorch; and (3) online deployment and serving of deep learning recommendation models. Additionally, the chapter covers the trade-offs between engineering execution and theoretical considerations, offering insights into how algorithm engineers can balance these aspects in practice.

Information

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2025

6 Engineering Implementations in Deep Learning Recommender Systems

In previous chapters, we introduced the key technical points of deep learning recommender systems from multiple perspectives, mainly from the theoretical and algorithm aspects. However, algorithms and models are only “good wine”; after all, they must be served in a suitable “container” to present the best taste. The “container” here refers to the engineering platform that implements the recommender systems.

From an engineering perspective, recommender systems can be divided into two parts – data and model. The data part mainly covers the related engineering implementations of the data pipeline needed by recommender systems, while the model part refers to the development of the recommendation model. Furthermore, the model development can be further divided into offline training and online serving based on the different stages of model application. Following the overall engineering architecture of recommender systems, this chapter is presented in three parts:

  1. (1) Data pipeline of recommender systems. We will introduce the main framework of the big data platform associated with data pipeline in a recommender system and the mainstream technologies for implementing the big data platform.

  2. (2) Offline training of deep learning recommendation models. It mainly introduces the popular platforms for training deep learning recommendation models, such as Spark MLlib, Parameter Server (parameter server), TensorFlow, and PyTorch.

  3. (3) Online deployment of deep learning recommendation models. We will cover the technical approaches to deploying deep learning recommendation models and the process of model online serving.

In addition to the engineering frameworks, we will also discuss the trade-off between engineering implementation and theory. Then we will share some of our thoughts on how algorithm engineers should make trade-offs balance between practice and theory.

6.1 Data Pipeline in Recommender Systems

In this section, we will walk through the data processing pipeline for training and serving recommendation models. Since 2003, when Google successively published three foundational papers in the field of Big Table [Reference Chang1], Google File System [Reference Ghemawat, Gobioff and Leung2], and Map Reduce [Reference Dean and Ghemawat3], the recommendation system has also entered the big data era. With TB or even PB size of training data, the data pipeline of recommender systems must be closely integrated with the big data processing and storage infrastructure to complete efficient training and online inferencing.

The development of big data platforms has gone through various stages from batch processing to stream computing, and then to full integration. The continuous development of architectural patterns has brought a substantial improvement in the freshness and flexibility of data processing. Following the order of development, the big data platform mainly includes four architectural modes: batch processing, stream computing, Lambda, and Kappa.

6.1.1 Batch Processing

Before the birth of the big data platform, it was difficult for traditional databases to deal with the storage and processing of massive data. In response to this problem, a distributed storage system represented by Google GFS and Apache HDFS was developed, which solved the problem of mass data storage. In order to further solve the problem of data processing, the Map Reduce framework was introduced, where data are distributed and computed across machines in “Map” steps followed by “Reduce” operation to achieve massive data manipulation in parallel. The “distributed storage + Map Reduce” architecture can only process static data that has been placed on the disk in batches but cannot process data during the data collection and transfer. That’s also where the name “batch processing” is from.

Compared with the classic data processing process on a traditional database, the batch processing architecture replaces the original data storage and computing method relying on the traditional file system and database with the distributed file system and Map Reduce. The schematic diagram of the batch processing architecture is shown in Figure 6.1.

Figure 6.1 Schematic diagram of batch processing.

However, this architecture can only process the data loaded in a distributed file system, so there will be significant data delay, which can cause a significant impact on the real-time performance of some applications. As a result, the stream processing solution came into being.

6.1.2 Big Data Stream Processing

The big data stream processing architecture consumes and processes the data stream during data generation and transfer (as shown in Figure 6.2). The concept of sliding windows in stream processing is very important. Within each “window,” data is temporarily cached and then consumed. After completing the data processing of one window, the stream computing platform slides to the next time window for a new round of processing. Therefore, in theory, the delay of data in stream processing is only determined by the size of the sliding window. In practical applications, the size of the sliding window is basically at the minute level, which greatly improves the data delay by several hours compared with the batch processing approach.

Figure 6.2 Schematic diagram of stream processing.

There are multiple open-source stream processing frameworks that are popular in the market, including Storm, Spark Streaming, Flink, and others. Flink emerged in recent years. It treats all data as streams and considers batch processing as a special case of stream processing. So Flink can be viewed as a native stream processing framework.

The stream processing framework can not only process a single data stream, but also perform join operations on multiple different data streams to combine the data within the same time window. In addition, the output of a stream process can also become the input of downstream applications, and the entire stream computing architecture is flexible and reconfigurable. Therefore, the advantages of stream computing big data architecture are obvious, that is, small data delay and flexibility in data pipeline configuration. This is very helpful for data monitoring, real-time updates of recommender systems’ features, and online training of recommendation models.

However, the stream processing architecture has drawbacks in some application use cases, such as data legality check, data playback, and full data analysis. Especially when the sliding window is very small, log disordering and data loss induced by join operations will accumulate data errors. So the data pipeline with solely stream processing is not perfect for all cases. This requires that some new big data architecture can integrate stream processing and batch processing to a certain extent and take advantage of each other’s strengths.

6.1.3 Lambda Architecture

Lambda architecture is an important architecture in the field of big data. The data platforms of most first-tier internet companies are basically built based on the Lambda architecture or its subsequent variants.

The data pipeline of the Lambda architecture splits the data collection stage into two branches – stream computing and offline processing. The stream computing branch maintains the stream processing architecture to ensure the freshness of data, while the offline processing part is mainly based on batch processing, which ensures the eventual consistency of data and provides the system with more data processing options. Figure 6.3 shows the schematic diagram of the Lambda architecture.

Figure 6.3 Schematic diagram of Lambda architecture.

To maintain the freshness of data, stream computing is mainly based on incremental computing, while the offline processing part performs full operations on the data to ensure its eventual consistency and the diversity of the final recommendation systems features. Before storing the statistical data in the final database, the Lambda architecture often merges the streaming data and the offline data. Then it uses the offline data to check and correct the streaming data, which is an important step in the Lambda architecture.

Lambda architecture can simultaneously possess both real-time processing and batch processing capabilities. It is a popular framework currently adopted by many companies. However, due to the large amount of logical redundancy in the streaming and offline processing parts, there are lots of repetitive codes thus many computing resources are wasted. Is it possible to further integrate the streaming and offline pipelines?

6.1.4 Kappa Architecture

Kappa architecture was created to solve the code redundancy problem in the Lambda architecture. Kappa architecture adheres to the principle of “Everything is streaming.” With this principle, all the data processing is treated as stream processing no matter it is actual real-time streaming or offline processing. In this case, offline batch is just a special case of “stream processing.” In a sense, the Kappa architecture can also be seen as an “upgraded” version of the stream processing architecture.

So specifically, how does the Kappa architecture handle batch processing through the same stream processing framework? In fact, batch processing also has the concept of a time window. But compared with stream processing, this time window is relatively large. The time window of stream processing may be five minutes, while batch processing may use one day. Considering this, batch processing can fully share the computing logic of stream processing.

Since the time window of batch processing is too long, it is impossible to implement it by stream processing in an online environment. Then, the problem becomes how to use the same stream processing framework to perform data batch processing in an offline environment.

In order to solve this problem, two new components, raw data storage and data replay, need to be added to the stream processing framework. Raw data storage saves the original data or logs in the distributed file system, and data replay components replay these original data in chronological order and process them with the same stream processing framework to achieve batch processing offline. This is the main idea of the Kappa architecture (as shown in Figure 6.4).

Figure 6.4 Schematic diagram of Kappa architecture.

Kappa architecture fundamentally completes the unification of stream and offline processing from Lambda architecture. It is a very beautiful and concise big data architecture. However, in the process of engineering implementation, there are still some difficulties in the Kappa architecture, such as the efficiency of the data replay and the issue of whether batch processing and stream processing operations can be fully shared. Therefore, Lambda architecture is still dominating the industry. But there have been some trends to gradually migrate to the Kappa architecture.

6.1.5 Integration of Big Data Platforms and Recommender Systems

The relationship between big data platforms and recommender systems is very close. In Section 5.3, we have introduced the importance of freshness in recommender systems in detail. Both the freshness of model features and the model itself heavily depend on the big data processing speed. More specifically, the integration of big data platforms and recommender systems is mainly reflected in two aspects:

  1. (1) Training data processing.

  2. (2) Feature precomputation.

As shown in Figure 6.5, no matter which big data architecture is adopted, the main task of the big data pipeline in the recommender system is to process features and training samples. According to different business use cases, after the feature processing is completed, the training sample and feature data eventually flow in two directions:

  1. (1) The offline big data storage is represented by HDFS. It is mainly responsible for storing samples for offline training.

  2. (2) The online feature store is represented by Redis. It is mainly responsible for providing real-time features for online inferencing.

Figure 6.5 Integration of big data platform and recommender systems.

Apache Hadoop, Hadoop, the Apache Hadoop Logo, Apache Flink, Flink, the Apache Flink Logo, Apache Spark, Spark, the Apache Spark Logo and Apache are either registered trademarks or trademarks of the Apache Software Foundation. Redis is a registered trademark of Redis Ltd. Any rights therein are reserved to Redis Ltd. Any use by Cambridge is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Cambridge.

The choice of big data architecture is closely related to how the recommendation model is trained. If the recommendation model wants to perform near-real-time or even real-time training updates, the requirements of the big data pipeline’s processing capabilities will be very high. It requires a stream processing framework to perform feature engineering calculations in real-time, joining operations on multiple data streams, and integrating the model updates in the data pipeline.

The architecture with the machine learning layer included is also known as unified big data architecture. The machine learning layer is added to the stream processing layer in a Lambda or Kappa architecture, which deeply integrates machine learning training and data processing.

After all, the relationship between recommender systems and big data platforms is inseparable in internet applications which generate massive data. All the cutting-edge recommender systems in the industry need deep integration with big data technologies to complete the entire process of training, updating, and online serving.

6.2 Spark MLlib for Offline Recommendation Model Training

We would like to make an analogy between recommender systems and cooking. Whether a chef can make a good dish depends on three key points:

  1. (1) Qualities of the cooking ingredients, like whether they are sufficient and fresh.

  2. (2) Cooking skills of the chef.

  3. (3) Performance while cooking. Whether the chef can make the best use of the materials and how well the chef can perform while cooking.

Correspondingly, the data collected in recommender systems are the “cooking ingredients,” and the richness and freshness of data are equivalent to the “sufficiency” and “freshness” of these ingredients. The offline training model of the recommender system corresponds to the chef’s training process. The more the chef is trained, the more types of cooking materials he has tried, the better his cooking skills could be. The online serving of recommendation systems can be analogized to the process of the chef “presenting his cooking skills.” A good dish cooked by the chef on-site doesn’t not only require all the ingredients to have consistently high quality as usual, but also requires high standards during the cooking process and making suitable adjustments based on the customer’s taste.

Next, we will introduce how the recommender system trains its “cooking skills” in an offline environment, and how to keep “high performance” on-site, so that the online service can provide real-time recommendations that best suit the user’s “taste.”

In internet applications such as recommendation, advertising, and search, the massive data volume in the size of terabyte or even petabyte-level makes it almost impossible to complete the model training in a single-machine environment. Distributed machine learning training provides a solution. With respect to the offline model training, we will introduce three mainstream solutions for distributed machine learning training, respectively – Spark MLlib, Parameter Server, and TensorFlow. They are not the only frameworks to choose from, but also represent three main approaches in distributed training. In this section, we will start with Spark MLlib and describe how it handles the problem of parallel training as the most popular big data framework.

Although challenged by rising stars such as Flink, Spark is still the most popular computing framework in the industry. Many companies choose Spark’s native machine learning framework, MLlib, for model training to maintain consistency with the technology stack adopted in the data and model pipeline. Spark MLlib became the first choice for distributed training in machine learning, not only because Spark is widely adopted, but also because Spark MLlib’s parallel training approach represents a naive and intuitive solution.

6.2.1 Distributed Computing Mechanism in Spark

Before introducing the distributed machine learning method in Spark MLlib, let’s first review Spark’s distributed computing mechanisms.

Spark is a distributed computing platform. Here, distributed computing means that computing nodes do not share memory and need to exchange data through network communication. It should be noted that the most typical application mode of Spark is to build on a large number of cheap computing nodes, which can be cheap hosts or virtual Docker Containers. This method is different from CPU+GPU architecture, as well as the high-performance server architecture with shared massive memory and multiprocessors. Knowing this is important for understanding the following computing mechanisms of Spark.

As illustrated in the Spark architecture diagram (Figure 6.6), the Spark program is scheduled and organized by the cluster manager (a cluster management node). Specific computing tasks are conducted on worker nodes, and results are returned to the driver program. Data may also be divided into different partitions on a physical worker node. It can be said that a partition is the basic processing unit of Spark.

Figure 6.6 Spark architecture diagram.

When executing a specific program, Spark will disassemble it into a task DAG (directed acyclic graph), and then determine the execution method of each step of the program according to the DAG. Figure 6.7 shows the DAG of a sample Spark job. The job reads files from textFile and hadoopFile, respectively, joins after a series of operations, and finally obtains the processing result.

Figure 6.7 A DAG example of one Spark job.

When executing the DAG shown in Figure 6.7 on the Spark platform, the most critical process is to find which components can be processed in parallel and which components must be shuffled and reduced.

The shuffle here means that the data needs to be redistributed to all partitions before proceeding to the next step. The most typical operations where shuffle can occur are the groupByKey operation and the join operation. Taking the join operation as an example, the textFile data and hadoopFile data must be matched globally to obtain the result data frame (a data structure in Spark) after joining. The group-ByKey operation needs to merge all the same keys in the data, thus also needs a global shuffle to complete.

In contrast, operations such as map and filter only need to process data one by one, and do not need to perform operations between data entries. This is so each partition can process its own data share, and data partitions can be processed in parallel.

In addition, before the program renders the final output, it needs to perform a reduced operation to summarize the results from each partition. As the number of partitions gradually decreases, the parallelism of the reduced operation gradually decreases until the final results are aggregated to the master node.

It can be said that the occurrence of shuffle and reduce operations determines the boundaries of purely parallel processing stages. As shown in Figure 6.8, Spark’s DAG is divided into different parallel processing stages.

Figure 6.8 DAG split by shuffle operations.

It should be emphasized that the shuffle operation requires data transfer between different worker nodes, which consumes a lot of computing, communication, and storage resources. Therefore, the shuffle operation should be avoided as much as possible in any Spark jobs. In summary, the components inside each stage can be computed parallelly and efficiently, and the shuffle operation or the reduce operation that consumes the most resources usually defines the stage boundary.

6.2.2 Parallel Training Mechanism in Spark MLlib

With the foundation of Spark’s distributed computing process, you can more clearly understand the model parallel training mechanism in Spark MLlib.

The model structure can determine the degree of parallelism it has in the training process. For example, a Random Forest model can fully perform data-parallel model training, while the structural characteristics of GBDT determine that it can only be trained sequentially. In this section, we will focus on the implementation of gradient descent method, because the parallelism of gradient descent directly determines the training speed of deep learning models.

In order to more accurately understand the specific implementation of the Spark parallel gradient descent method, we will dive deep into the source code of Spark MLlib, and directly post the source code of Spark for minibatch gradient descent (the code is taken from the runMiniBatchSGD function of the Spark 2.4.3 GradientDescent class):


converged  i  numIterations) 
val bcWeights = data.context.broadcastweights
// Sample a subset (fraction miniBatchFraction) of the total data
// compute and sum up the subgradients on this subset (this is one map-reduce)
val gradientSum, lossSum, miniBatchSize = data.sample , miniBatchFraction, 42  i
.treeAggregateBDV.zerosDoublen, 0.0, 0L
seqOp = c, v
// c: (grad, loss, count), v: (label, features)
val l = gradient.computev._2, v._1, bcWeights.value, Vectors.fromBreezec._1
c._1, c._2  l, c._3  1
,
combOp = c1, c2
// c: (grad, loss, count)
c1._1  c2._1, c1._2  c2._2, c1._3  c2._3

bcWeights.destroyblocking 
miniBatchSize  0
/**
* lossSum is computed using the weights from the previous iteration
* and regVal is the regularization value computed in the previous iteration as well.
*/
stochasticLossHistory  lossSum  miniBatchSize  regVal
val update = updater.compute
weights, Vectors.fromBreezegradientSum  miniBatchSize.toDouble,
stepSize, i, regParam)
weights = update._1
regVal = update._2
previousWeights = currentWeights
currentWeights = Someweights
previousWeights = None  currentWeights = None
converged = isConvergedpreviousWeights.get,
currentWeights.get, convergenceTol


logWarnings”Iteration ($i/$numIterations). The size of sampled batch is zero”

i  1

This code looks complicated at first glance. But after we extract the key operations as shown next, the main process of Spark gradient descent calculation turns easy to understand.


i  numIterations//limit the maximum iterations
val bcWeights = data.context.broadcastweights//broadcasting all the weights
val gradientSum, lossSum, miniBatchSize data.sample , miniBatchFraction, 42  i
.treeAggregate//compute the gradient in each sample, then get gradientSum using treeAggregate function
val weights  updater.computeweights, gradientSum  miniBatchSize//update the weights based on gradients
i  1//iteration + 1
}

This simplified code is quite easy to understand. Basically, Spark’s minibatch process has three steps:

  1. (1) Broadcast the current model parameters to each data partition (which can be used as a virtual computing node).

  2. (2) Each computing node performs data sampling to obtain minibatch data, calculates the gradients separately, and then aggregates the gradients through the treeAggregate operation to obtain the final gradientSum.

  3. (3) Use gradientSum to update model weights.

In this way, the boundaries of stages in each iteration are very clear. The parallel part inside each stage is the process of sampling and calculating the gradient of each node separately, and the boundary of stage is the process of summarizing and summing the gradient of each node. Here we will highlight the operation treeAggregate, which aggregates the gradients from all the nodes. This operation is a layer-by-layer aggregation based on a tree-like structure. The whole process is a reduce operation and does not include a shuffle operation. In addition, a hierarchical tree operation is used. Tree node operations are performed in parallel, so the whole process is very efficient.

After the number of iterations reaches the upper limit or the model has sufficiently converged, the model stops training. This is the whole process of minibatch gradient descent calculation in Spark MLlib, and it is also the most representative implementation of distributed model training in Spark MLlib.

In summary, Spark MLlib’s parallel training process is actually through data parallelism, which does not involve a complex gradient update strategy, and does not implement parallel training through parameter parallelism. This method is simple, intuitive, and easy to implement, but there are also some limitations.

6.2.3 Limitations of Spark MLlib Parallel Training

Although Spark MLlib is based on distributed computing and uses data parallelism to achieve parallel training of gradient descent, the limitations of Spark MLlib parallel training are also very obvious while training a complex neural network. The issues are mainly slow training speed and memory overflow when there are too many model parameters. Specifically, Spark MLlib’s distributed training method has the following drawbacks:

  1. (1) The global broadcast method is adopted to broadcast all model parameters before each iteration. As we all know, Spark’s broadcasting process consumes a lot of resources, especially when the parameter scale is very large. The broadcasting process and maintaining a copy of the weight parameters at each node are extremely resource-intensive, which induces poor performance with complex models.

  2. (2) While using the block gradient descent method, each round of gradient descent is determined by the slowest node. From the source code of Spark gradient descent, it can be seen that the minibatch process of Spark MLlib is to aggregate the gradient layer by layer after all nodes calculate their respective gradients, and finally generate the global gradient. If some problem such as data skew causes a node to take too long to calculate the gradient, it will block all other nodes from performing the next tasks. This synchronized block distributed gradient calculation method is the main reason for the low efficiency of Spark MLlib parallel training.

  3. (3) Spark MLlib does not support complex deep learning network structures and large-scale hyperparameter tuning. Additionally, Spark MLlib only supports standard MLP training in its standard library, and does not support complex network structures such as RNN and LSTM. It cannot conduct large-scale hyperparameter tuning such as different activation functions. All these drawbacks make Spark MLlib less competitive in deep learning applications.

As a result, we need a more robust deep learning framework with higher training efficiency and support for more flexible network structures. Considering these, Parameter Server becomes the mainstream framework for distributed machine learning with its efficient distributed training methods. And deep learning platforms such as TensorFlow and PyTorch have become the top choices in distributed machine learning systems due to their flexibility on network structure adjustments, and comprehensive training and serving support. The next section introduces the main mechanisms of Parameter Server and TensorFlow, respectively.

6.3 Parameter Server for Offline Recommendation Model Training

In Section 6.2, we gave a detailed introduction to the parallel training method in Spark MLlib. Spark adopts a simple and direct data-parallel method to solve the problem of model parallel training. But Spark’s parallel gradient descent method is using synchronized blocking approach, and the model parameters need to be transferred to all nodes through global broadcasting. These processes make Spark’s parallel gradient descent calculation relatively inefficient.

In order to solve this problem, the Parameter Server [Reference Li4,Reference Li5], a distributed and scalable framework, was proposed in 2014. It almost perfectly solves the distributed training problem of machine learning models. Today, Parameter Server is not only directly adopted in the machine learning platforms of some big companies, but also integrated into mainstream deep learning frameworks such as TensorFlow and MXNet, as an important solution for distributed training of machine learning.

6.3.1 Mechanisms of Distributed Training in Parameter Server

First, take a general machine learning problem as an example to explain the mechanism of Parameter Server distributed training.

Fw=i=1nlxi,yi,w+Ωw(6.1)

Equation 6.1 shows a general loss function with a regularization term, where n is the total number of samples, l(xi,yi,w) is the loss function for calculating a single sample  i. xi is the feature vector, yi is the sample label, and w represents the model parameters. The training objective of the model is to minimize the loss function F(w). To solve argminF(w), gradient descent is often used. The main function of the Parameter Server is to perform the gradient descent calculation in parallel, and update of the model parameters until the final convergence. It should be noted that the regularization term in the formula requires summarizing all model parameters, which makes it difficult to perform fully parallel training. Therefore, Parameter Server adopts the same data-parallel training approach as Spark MLlib to generate local gradients, and then aggregate all the gradients to update the model weights.

The pseudocode of parallel gradient descent calculation in Parameter Server are shown in Code 6.1:

Code 6.1 Parallel gradient descent process in Parameter Server

It can be seen that the Parameter Server consists of server nodes and worker nodes, and their main functions are as follows:

  • The main function of the server node is to save the model weights, receive the local gradient calculated by the worker node, aggregate and calculate the global gradient, and update the model parameters.

  • The main function of the worker node is to load some training data, pull the latest model parameters from the server node, calculate the local gradient based on the training data, and upload it to the server node.

From the perspective of the framework architecture, the manager–worker structure of Parameter Server and Spark are basically the same, as shown in Figure 6.9.

Figure 6.9 Architecture of Parameter Server.

As illustrated in Figure 6.9, Parameter Server consists of two major components – server group and multiple worker groups. The resource manager is responsible for overall resource allocation and scheduling.

The server node group contains multiple server nodes, each of which is responsible for maintaining some parameters. The server manager in the server node group is responsible for maintaining and allocating server resources.

Each work node group corresponds to an application (that is, a model training task). There is no communication between the work node groups or the task nodes within the same work node group. The task nodes only communicate with the server.

Based on the architecture of Parameter Server, the process of parallel training process is depicted in Figure 6.10.

Figure 6.10 Schematic diagram of the parallel training process in Parameter Server.

Two important operations in the parallel gradient descent process of Parameter Server are push and pull:

  • Push Operation: The worker node uses the local training data to calculate the local gradient and upload it to the server node.

  • Pull Operation: The worker node pulls the latest model parameters from the server node to the local to perform the next round of gradient calculation.

Following the illustration in Figure 6.10, the whole parallel training process in Parameter Server can be summarized as follows:

  1. (a) Each worker node loads a portion of the training data.

  2. (b) Each worker node pulls the latest model parameters from the server node.

  3. (c) Each worker node calculates the gradient based on local training data.

  4. (d) Worker nodes push the gradients to the server node.

  5. (e) The server node collects all the gradients and aggregates to update the model.

  6. (f) Repeat from step (b) until the total iterations reach the upper limit or the model has converged.

6.3.2 Trade-Off between Consistency and Parallel Efficiency

When summarizing Spark’s parallel gradient descent mechanism, we mentioned that the reason of low efficiency in Spark was caused by the “synchronized blocking” in the calculation.

This parallel gradient descent process requires that (1) the gradients of all nodes are calculated, (2) the master node aggregates the gradients, and (3) the new model parameters are updated before the next round of gradient calculation can begin. This means that the “slowest” node blocks the gradient update process of all other nodes.

On the other hand, the “synchronized blocking” parallel gradient descent is the most “consistent” gradient descent method, because its calculation results are strictly consistent with those of sequential gradient descent on a single machine.

So is there any way to improve the parallel efficiency of gradient descent with some balance of consistency?

Parameter Server replaces the original “synchronized blocking” method with “asynchronized non-blocking” gradient descent. Figure 6.11 shows the process of calculating the gradients in multiple iterations of a worker node. It can be seen that when the node is conducting the 11th iteration (iter 11) calculation, the push and pull process of the 10th iteration has not ended. That is to say, the latest model parameters after iter 10 have not been pulled to the local in iter 11. In iter 11, the node calculates the gradient still using the same model parameters as those used in iter 10. This is the so-called asynchronized non-blocking gradient descent method, and the progress of other nodes in calculating the gradient will not affect the gradient calculation of current node. All nodes are always working in parallel and will not be blocked by other nodes.

Figure 6.11 The process of calculating the gradient in multiple iterations.

Of course, any technical solution is based on certain trade-offs. Although the asynchronized gradient update method greatly speeds up the training speed, it sacrifices the model consistency. That is to say, the results of parallel training are inconsistent with the results of the original single-thread sequential training, and such inconsistency will have a certain impact on the speed of model convergence. So the final choice of synchronized update or asynchronized update depends on the sensitivity of different models to consistency. This is similar to a model hyperparameter selection problem, which requires specific validations case by case.

In addition, between synchronized and asynchronized, you can also control the degree of asynchronized calculation by setting parameters such as max delay. For example, we can configure that the model parameters must be updated at once within three iterations. If a worker node has calculated three gradients and has not completed the process of pulling the latest model parameters from the server node, the worker node must stop and wait for the completion of the pull operation. This is a compromise between synchronized and asynchronized calculation.

This section describes the difference between synchronized and asynchronized updates in the parallel gradient descent method. In terms of efficiency, users need to monitor the following two indicators:

  1. (1) How much waiting time can be saved by “asynchronized” updates?

  2. (2) “Asynchronized” updates will reduce the consistency of gradient updates. Will it take the model longer to converge?

In response to these two questions, the original Parameter Server paper provided a comparison of the efficiency of asynchronized and synchronized updates (based on Sparse logistic regression model training). In Figure 6.12, it compares the computing time and waiting time between synchronized gradient update strategy and asynchronized gradient update strategy adopted by Parameter Server.

Figure 6.12 Comparison of computing time and waiting time between synchronized and asynchronized strategies.

In Figure 6.12, System-A and System-B are two systems that both update gradients synchronously, and Parameter Server is using an asynchronous update strategy. It indicates that the computing time of Parameter Server is much higher than that of the systems with synchronous update strategy, which proves the efficiency has been significantly improved in Parameter Server.

Also, the convergence speed of two strategies is compared in Figure 6.13.

Figure 6.13 Convergence speed of different strategies.

As shown in Figure 6.13, the convergence speed of asynchronized update strategy in Parameter Server is faster than that of System-A and System-B with synchronized update strategy adopted. This proves that the impact of the inconsistency caused by the asynchronous update is not as great as expected.

6.3.3 Coordination and Efficiency in Multiserver Mode

Another cause for the inefficiency of parallel training in Spark MLlib is that each iteration requires the master node to broadcast the model weight parameters to the worker nodes. This leads to two problems:

  1. (1) The master node becomes the bottleneck with the constraints on resources. As a result, the efficiency of the overall model parameters transfer is not high.

  2. (2) All model parameters are broadcast synchronously, which makes the overall network load of the system very heavy.

So how does Parameter Server solve the problem of inefficient single master mode? As can be seen from the architecture diagram shown in Figure 6.9, the Parameter Server adopts the architecture of multiserver in the server node group, and each server is just responsible for partial model parameters. The model parameters are in the form of key-value pairs, so each server is responsible for parameter updates within a parameter key range.

Then another question comes, how does each server decide which part of the parameter range it is responsible for? If a new server node is added, how can a new node be added while ensuring that the existing parameter range does not change significantly? The answers to these two questions involve the principles of consistent hashing. Figure 6.14 shows the consistent hashing ring composed of server nodes.

Figure 6.14 Consistent hashing ring composed of server nodes.

In the server node group of Parameter Server, the process of applying consistent hashing to manage parameters is roughly as follows:

  1. (1) Map the keys of model parameters to a hash ring space. For example, if there is a hash function that can map any key to the hash space of 0~2321, just connect the bucket 2321 with bucket 0, then this space becomes a hashing ring space.

  2. (2) According to the number of server nodes n, the hashing ring space is divided into nm ranges, and each server is allocated with m hash ranges alternatively. The purpose of this is to ensure load balance and avoid uneven server load caused by some hotspot hashing ranges.

  3. (3) When a new server node is added to the group, the new server node will find the insertion point on the hash ring. Afterwards, the new server will be responsible for the hashing range between the insertion point and the next range boundary. The hashing segment is divided into two parts, where the new server node is responsible for the second half, and the original node is responsible for the first half. This will not affect the hash allocation of other hash ranges, thus there is no problem of large data shuffling caused by hashing rearrangement.

  4. (4) When a server node is deleted, the insertion point associated with it is removed, and the adjacent node is responsible for the hash range of the removed node.

Applying the consistent hashing to the server node group of Parameter Server can effectively reduce the bottleneck problem caused by the original single-sever mode. With usage of multiserver mode, when a worker node wants to pull new model parameters, the node will send separate range pull requests to different server nodes, after which each server node can send the model parameters that it is responsible for to the worker node in parallel.

In addition, since the server nodes can also coordinate efficiently while processing the gradient, and after a worker node calculates its own gradient, it only needs to use the range push operation to send the gradient to the related server nodes. Of course, this process is also related to the model structure and needs to be combined with the implementation of the model itself. In general, Parameter Server provides the ability to pull and push parameter ranges based on consistent hashing, making the implementation of model parallel training more flexible.

6.3.4 Summary of Technical Key Points in Parameter Server

The key points of Parameter Server in distributed machine learning model training are as follows:

  • Replace the synchronized blocking gradient descent strategy with an asynchronized non-blocking distributed gradient descent strategy.

  • Implement a multiserver node architecture to avoid bandwidth bottlenecks and memory bottlenecks caused by a single-server node.

  • Utilize engineering methods such as consistent hashing, parameter range pulling, and parameter range pushing to achieve minimal data transfer. This design avoids global network congestion and bandwidth waste caused by broadcast operations.

Parameter Server is only a framework to manage the parallel training and does not involve specific model implementation. Therefore, Parameter Server is often used as a component of MXNet and TensorFlow. To implement a machine learning model specifically, it is necessary to rely on general and comprehensive machine learning frameworks. Section 6.4 introduces the mechanism of modern machine learning frameworks represented by TensorFlow.

6.4 TensorFlow for Offline Training of Recommendation Models

The application of deep learning is increasingly deepening in various fields, and the development of major deep learning platforms is also advancing by leaps and bounds. Google’s TensorFlow [Reference Abadi6,Reference Abadi7], Amazon’s MXNet, Facebook’s PyTorch, Microsoft’s CNTK, and others are all deep learning frameworks launched by major technology giants. Unlike Parameter Server, which mainly focuses on model parallel training, the aforementioned deep learning frameworks include almost all steps related to deep learning models, such as model building, parallel training, and online serving. This section takes TensorFlow as an example and introduces model training mechanisms in the deep learning framework, especially the technical details of parallel training.

6.4.1 Fundamentals of TensorFlow

The name “TensorFlow” very accurately expresses the fundamental idea of this framework – constructing a DAG based on the deep learning model architecture, and allowing data to flow in it in the form of tensors.

A tensor is a high-dimensional extension of a matrix, and a matrix is a special form of tensor in a two-dimensional space. In deep learning models, most of the data is represented in matrices or even higher-dimensional tensors. It is very appropriate that Google named its deep learning platform “TensorFlow.”

In order to make tensors flow, a task relationship graph consisting of vertices and edges is built for each deep learning model following its structure, where each vertex represents a certain operation, such as pooling operation, activation function, and so on. Each vertex can receive 0 or more input tensors and output 0 or more output tensors. These tensors flow along the directed edges between the points until the final output layer.

Figure 6.15 shows a simple TensorFlow task relationship graph, where vector b, matrix W, and vector x are the inputs of the model. The purple vertices MatMul, Add, and ReLU are operation vertices, representing operations such as matrix multiplication, vector addition, and ReLU activation functions, respectively. The model input tensors W, b, and x flow among vertices after being processed by the operation vertices.

Figure 6.15 A simple TensorFlow-directed graph.

In fact, any complex model can be transformed into the form of a task relationship graph of operations. It is not only conducive to the modularization of operations and the flexibility to define and implement model structures, but also clarifies the dependencies between operations. Additionally, this graph representation also determines which operations can be executed in parallel and which can only be executed in serial, which lays the foundation for maximizing training efficiency in the parallel training platform.

6.4.2 Parallel Training Process in TensorFlow Based on Task Graph

After constructing a task relationship graph composed of “operations,” TensorFlow can perform flexible task scheduling based on the task relationship graph to maximize the usage of parallel computing resources like multiple GPUs or distributed computing nodes. The general principle of scheduling is that the task nodes or subgraphs with dependencies need to be executed serially, and the task nodes or subgraphs without dependencies can be executed in parallel. Specifically, TensorFlow uses a task queue to solve the dependency scheduling problem. Here, we take an official task relationship graph of TensorFlow as an example (as shown in Figure 6.16) to illustrate the specific process.

Figure 6.16 An example of a task relationship graph given by TensorFlow.

As shown in Figure 6.16, the original operation node relationship graph is further processed into a relationship graph composed of operation nodes and task subgraphs. Among them, the subgraph is composed of a set of serial operation nodes. Due to the pure sequential relationship, it can be treated as an indivisible task node in parallel task scheduling.

In the specific parallel task scheduling process, TensorFlow maintains a task queue. When all the pre-order tasks of a task are executed, the current task can be pushed to the end of the task queue. When there is an idle computing node, the computing node pulls a task at the head of the queue from the task queue for execution.

Taking Figure 6.16 as an example, after the input node, Operation 1 and Operation 3 will be pushed to the task queue at the same time. At this moment, if there are two idle GPU computing nodes, Operation 1 and Operation 3 will be popped out, and executed in parallel. After the execution of Operation 1, Subgraph 1 and Subgraph 2 will be pushed to the task queue for subsequent execution. After Subgraph 2 is executed, the pre-order dependencies of Operation 2 are removed, and Operation 2 is pushed to the task queue. The pre-order dependencies of Operation 4 are Subgraph 2 and Operation 3. Operation 4 will be pushed to the task queue only when these two pre-order dependencies are all completed. When the tasks on all computing nodes have been executed and there are no pending tasks in the task queue, the entire training process ends.

The task relationship graph of TensorFlow and the DAG task relationship graph of Spark have similarities in principle. The difference is that the role of Spark DAG is to clarify the sequence of tasks, and the granularity of tasks remains at the transformation operation level such as join, reduce, and so on. Spark’s parallel mechanism is more of parallel execution within tasks. However, TensorFlow’s task relationship graph decomposes tasks into very fine-grained operation levels, and accelerates the training process by executing independent subtasks in parallel.

6.4.3 Single-Machine Training vs. Distributed Training in TensorFlow

TensorFlow’s computing platform has two main modes: one is single-machine training, and the other is multimachine distributed training. For single-machine training, although the parallel computing process of CPU and GPU is also included in the execution process, it is generally in a shared memory environment, so there is no need to consider communication problems too much. The training method in a cluster environment is composed of independent computing nodes. The computing nodes need to rely on network communication to transfer the data, so it can be considered as a computing environment similar to the Parameter Server introduced in Section 6.3.

As shown in Figure 6.17, the single-machine training of TensorFlow is performed on a worker node, and the single worker node performs parallel computing among different GPU+CPU nodes according to the task relationship graph. In a distributed environment, there are multiple worker nodes. If TensorFlow’s Parameter Server strategy (tf.distribute.experimental.ParameterServerStrategy) is used, each worker node will be trained in a data-parallel manner. That is to say, each worker node is trained in the same task relationship graph, but the training data is different, and the generated gradients are aggregated and updated in the form of Parameter Server.

Figure 6.17 Single-machine and distributed training environments for TensorFlow.

Here, we will introduce specific task allocation of the CPU and GPU within each worker node. The GPU has the advantage of multicore, so it has a huge advantage over the CPU when dealing with tensor operations such as matrix addition and vector multiplication. When processing a task node or task subgraph, the CPU is mainly responsible for data and task scheduling, while the GPU is responsible for computationally intensive tensor operations.

For example, when processing the element multiplication operation of two vectors, the CPU will act as the central scheduler, send the elements of the corresponding ranges of the two vectors to the GPU for processing, then collect the processed results, and finally generate the result vector. From this perspective, the combination of CPU+GPU is also like a “simplified” Parameter Server.

6.4.4 Summary of TensorFlow Technical Key Points

This section presents the operation of the deep learning platform represented by TensorFlow from the perspective of the training mechanism. The technical key points can be summarized as follows:

  1. (1) The principle of TensorFlow is to convert the model training process into a task relationship graph. The training data flows in the task relationship graph in the form of tensors to complete the entire training.

  2. (2) TensorFlow performs task scheduling and parallel computing based on the task relationship graph.

  3. (3) For distributed training in TensorFlow, its training process can be divided into two layers. One layer is the data-parallel training process based on the Parameter Server architecture, and the other layer is the parallel computing process at the CPU+GPU task level inside each worker node.

Learning TensorFlow and its tuning will involve a lot of basic knowledge. The models, operations, and training methods supported by TensorFlow are also very comprehensive. We don’t intend to give a very in-depth introduction. Interested readers can use the contents of this chapter as a starting point, and use the official documents and tutorials provided by TensorFlow as a guide for more systematic learning.

6.5 Online Serving of Deep Learning Recommendation Model

The previous sections introduced the offline training platform for deep learning recommendation models. Whether it is TensorFlow, PyTorch, or the traditional Spark MLlib, they all provide a relatively mature offline parallel training environment. The recommended model must be used in the online environment after all. How to deploy the offline trained model to the online production environment for online real-time inference has always been a difficult task in the industry. This section will introduce the mainstream methods for deploying recommendation models.

6.5.1 Pre-Stored Recommendation Results or Embeddings

For online serving of the recommendation model, the simplest and most straightforward method is to generate the recommendation results of each user in an offline environment, and then pre-store the results in an online database such as Redis. You can directly extract the pre-stored data in the online environment and recommend it to the user online. The pros and cons of this method are obvious. The pros are as follows:

  1. (1) There is no need to implement the process of model online inference. The offline training platform is completely decoupled from the online service platform, and any offline machine learning tool can be flexibly selected for model training.

  2. (2) There is no complicated calculation in the online service process, so the online latency of the recommender system is extremely low.

The cons of this method are as follows:

  1. (1) It needs to store the recommendation results for the combinations of users, items and application scenarios. When the number of users and items is very large, it can easily encounter combination explosion, and the online database cannot support the storage of such large-scale results.

  2. (2) The online contextual features cannot be introduced, so the performance of the recommender system is limited.

Considering these pros and cons, the method of directly storing recommendation results is usually adopted only for small user scales, or some special application scenarios such as cold start and popular lists.

Pre-computing and storing the embeddings for user and item is another way to replace online inferencing with stored data. Compared with directly storing the recommendation results, the method of storing embeddings greatly reduces the amount of storage. It only needs to conduct the inner product or cosine similarity operation to obtain the final recommendation result online, which is a method that is often used in the industry to deploy the model.

This method cannot support the introduction of online contextual features, and cannot perform online inference by complex model. As a result, the expressivity of the recommendations is limited. Therefore, complex models still require a recommender system capable of online inferencing.

6.5.2 Self-Developed Model Online Serving Platform

Whether in the era when deep learning was just emerging a few years ago, or today when TensorFlow and PyTorch have become popular, self-developed machine learning training and online serving platforms are still a popular option for many large and medium-sized companies. Why do these companies not use the flexible and mature frameworks such as TensorFlow, but still develop their own model training and serving platform from scratch?

An important reason is that general-purpose platforms such as TensorFlow need to support a large number of redundant functions for flexibility and versatility, which makes the platform too heavy and difficult to modify and customize. The advantage of the self-developed platform is that it can be customized according to the company’s business and actual needs, as well as taking into account the efficiency of model serving.

The author has participated in the implementation of Follow the Regularized Leader (FTRL), neural network models, and the development of customized online serving platforms. Since it does not depend on any third-party tools, the online serving process can be designed based on the actual production environment. For example, if a Java server is used for online service, the process of online FTRL is to obtain model parameters from the parameter server or in-memory database, and then use customized Java code to implement the model inferencing logic.

Another reason is that most deep learning frameworks cannot support cases when the model has special needs, such as some retrieval models in recommender systems, “exploration and utilization” models, and cold start algorithms that are very tightly coupled with the specific business use case. The online serving code of such models usually needs to be self-developed.

The disadvantages of self-developed platforms are obvious. It is feasible to develop one or two models with customized code. But it is difficult to implement, compare, and tune dozens of models due to the high cost of implementing models. Nowadays, the iteration cycle of self-developed models is too long considering the emergence of new models. Therefore, self-developed platforms and models are often only used by large companies, when the model structure is already determined. In this case, the model inferencing code also needs to be manually implemented.

6.5.3 Pre-Trained Embedding and Lightweight Online Model

Fully adopting a self-developed platform has clear drawbacks, like heavy engineering efforts and poor flexibility. Today, with the rapid evolution of various complex models, the disadvantages of the self-developed model are clearer. Is there any way to combine the flexibility of the general platform, the diversity of functions, and online inferencing efficiency from the self-developed platform? The answer is yes.

Many companies in the industry have adopted a new recommender system design pattern with offline training of complex networks, generating embeddings, and storing them in in-memory databases. A lightweight model such as logistic regression or shallow neural networks is used for online recommendation. The “two-tower” model introduced in Section 4.3 is a typical example (as shown in Figure 4.5).

The two-tower model uses complex networks to embed the user features and item features, respectively. Before the final cross layer, there is no interaction between the user features and the item features, which forms two independent “towers.”

After completing the training of the two-tower model, the final user embeddings and item embeddings can be stored in the in-memory database. When performing online inference, there is no need to reproduce the complex network, and only the logic of the final output layer needs to be implemented. The output layer here is mostly logistic regression or softmax or a shallow neural network. But they are all relatively simple to implement. After the user embedding and the item embedding are fetched from the in-memory database, the final prediction can be obtained through the online calculation of the output layer.

With such architecture, some other contextual features can also be used together with user and item embeddings in the final output layer. This enables us to introduce more real-time features and enrich the feature sources of the model.

Nowadays, when Graph Embedding technology has become very powerful. The offline training method of embeddings can integrate a large amount of user and item information, and the output layer does not need to be very complicated. Therefore, the method of embedding pre-training plus a lightweight online model is a flexible and simple approach for recommender systems without much impact on the model performance.

6.5.4 Model Transformation and Deployment with PMML

The embedding + lightweight model approach is practical and efficient, but it breaks the model into pieces without end-to-end training or deployment. Is there a way to deploy the model directly after training offline? This section introduces a platform-agnostic model deployment method – Predictive Model Markup Language (PMML).

PMML is a general markup language that expresses different model structure parameters in the form of XML. In the process of online model deployment, PMML often acts as an intermediary between the offline training platform and the online inference platform.

Here, Spark MLlib is used as an example to explain the role of PMML in the entire machine learning model training and online deployment process, as illustrated in Figure 6.18.

Figure 6.18 Utilizing PMML for model deployment in the Spark MLlib framework.

The example in Figure 6.18 uses JPMML as the library for serializing and parsing PMML files. The JPMML library is being used in two parts of this design: the Spark platform and the Java server. Inside the Spark platform, it completes the serialization of the Spark MLlib model, generates PMML files, and saves them to the database or file system that can be reached by the online server. On the Java inference server, the JPMML library parsed the PMML model file, and generates the model object to integrate with the other business logic code.

JPMML only performs inference on the Java server, and does not need to consider the model training and distributed deployment. Therefore, the library is relatively light and can efficiently complete the inference process. Another similar open-source project is MLeap, which also uses PMML as a medium for model transformation and online serving.

In fact, JPMML and MLeap also have the ability to convert and launch simple models in Scikit-learn and TensorFlow. However, for complex models in TensorFlow, the expressivity of PMML language is not enough. So launching a complex TensorFlow would need a more native solution – TensorFlow Serving.

6.5.5 TensorFlow Serving

TensorFlow Serving is a native model server developed by TensorFlow. Essentially, the workflow of TensorFlow Serving is the same as that of PMML-like tools. The difference is that TensorFlow defines its own model serialization standard. Using the model serialization function that comes with TensorFlow, the trained model parameters and structures can be saved to a designated file.

The most common and convenient way to use TensorFlow Serving is to use Docker to build a model serving API. After the Docker environment is prepared, the installation and preparation of the TensorFlow Serving environment only needs to be done by pulling the image (pull image) using the following command:


docker pull tensorflow/serving

After starting the docker container, we can also start the model service API with only one line of command:


tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME}

Here, we just need to change the model file path.

Of course, it is not easy to build a complete set of TensorFlow Serving services, because it involves a series of engineering issues such as model update, maintenance, and on-demand expansion of the entire docker container cluster. The performance of TensorFlow Serving is still criticized by the industry because of its limitations, but its ease of use and support for complex models make it the first choice for launching TensorFlow models.

6.5.6 Flexible Choice of Model Serving Method

The issue in the deep learning recommendation model online serving is essentially an engineering problem, because it is closely tied to the company’s online server environment, hardware environment, offline training environment, database/storage system, and so on. Because of this, the approach taken by each company is different. Even if this section has listed five main online methods, it cannot cover all recommended model online methods in the industry. Even within a company, for different business scenarios, the online methods of models are not always the same.

Therefore, as a machine learning engineer, in addition to understanding the mainstream model service methods, one should also comprehensively weigh the company’s engineering environment to give the most suitable solution.

6.6 Trade-Off between Engineering and Theory

Engineering and theory can be sometimes conflicting as well as interdependent in the process of solving technical problems. Theory needs to depend on engineering implementation to be productionized. Otherwise, theory will be just a pie in the sky and cannot be applied in the practical applications. However, engineering capabilities can restrict the development of theory. In this section, we will discuss how to make trade-offs between engineering and theory in the field of recommender systems.

6.6.1 Nature of the Engineer’s Responsibilities

The trade-off balance between engineering and theory is something that every engineer needs to consider. Every engineer should have an engineering mindset instead of the “research thinking” that scholars have. Recommender systems is a field with a strong emphasis on engineering implementations and aiming at putting theory into production as its primary goal. The importance of an engineering mindset is self-evident. Next, we will explain how to make trade-offs between engineering and theory from the perspective of an engineer.

Whether you are a machine learning engineer, a software engineer, or even an engineer designing an electric vehicle or rockets, your responsibilities are the same. You need to find and utilize the optimal solution to deliver the product given some constraints.

In the recommender systems context, the constraints here can be restrictions from the R&D schedule, limitations of the hardware and software environment, requirements of actual business logic and application scenarios, or from the business objectives defined by the product manager, and so on. Because of these constraints, it is impossible for an engineer to arbitrarily try new technologies and do as many exploratory innovations as researchers in academia. Thus making a balance between cutting-edge theory and engineering implementations is a basic quality an engineer should have. In the following sections, we will use three practical cases to help readers understand how to make technical trade-offs in actual projects.

6.6.2 Trade-Off between Redis Capacity and Online Model Deployment

For online recommender systems, both model parameters and online features are necessary for online model inferencing. In order to ensure real-time performance with low data query latency, many companies use the in-memory database method to host the data. Among the data storage solutions, Redis has become the mainstream choice. However, Redis needs to take a lot of memory resources, and memory resources are relatively scarce and expensive compared with others. Therefore, whether you use AWS (Amazon Web Services, Amazon Web Services Platform), Alibaba Cloud, or a self-built data center, the cost of using Redis is relatively high, and the capacity of Redis has become a key factor restricting the ways how the recommendation model goes online.

Due to such constraints, engineers must consider the problem from two aspects:

  1. (1) The model’s parameter scale should be as small as possible. Especially for the deep learning recommendation models, the parameter quantity of the model has been increased by several orders of magnitude compared with the traditional model.

  2. (2) The number of features used for online estimation cannot be increased indefinitely, and a certain degree of trade-off should be made based on the feature’s importance.

To launch a recommender system under such constraints, it is necessary to drop some unimportant factors and focus on the key points. An experienced engineer’s thinking would be like this:

  1. (1) If the feature dimensions are tens of millions or even higher, theoretically the order of magnitude of parameters is also in the order of tens of millions. It is difficult for online services to support this level of data volume, which requires engineering improvement on the sparsity of the model. It is important to focus on the key features and discard secondary features. Even if it may impact some certain model prediction accuracy, it will help reduce online inferencing latency and reduce the consumption of engineering resources.

  2. (2) What are the key technical points to enhance the model’s sparsity? We can add L1 regularization term or adopt a training method with strong sparsity such as FTRL.

  3. (3) There are many technical approaches to achieving the goal. When it is impossible to determine which technology is better, it is a good choice to implement all of them, and use offline and online indicators for comparison.

  4. (4) Determine the final technical approach based on the data and improve the engineering implementation.

This is the simplification method on the model side. Of course, the same idea can be adopted in the online feature reduction. Firstly, the method of principal component analysis can be used for feature screening. Then, the online features can be reduced without significantly impacting the model performance. For the features that are not easy to choose, conduct offline evaluation and online A/B testing to finally reach the level that the engineering system can support.

6.6.3 Research Development Schedule Constraints and Trade-Offs in Technology Selection

In the actual engineering environment, the constraints of the research and development schedule are also a factor that cannot be ignored. This involves the engineer’s ability to control the overall project and to estimate the development timeline. No one wants to be the slowest link to drag down other teams in the IT industry, where product iterations are increasingly rapid.

In the process of upgrading the technology stacks, it is necessary to fully weigh the new requirements of the product and the progress of the overall migration. For example, the company hopes to migrate the machine learning platform from Spark to TensorFlow. This is a technical decision to follow the latest technological trends. However, due to the characteristics of the Spark platform, the programming language and model training methods are quite different from TensorFlow. The entire migration must be through a long development cycle. During the migration process, if there are new product requirements, engineers need to make trade-offs and take into account the daily development progress in the process of technology upgrade.

There are two possible technical paths:

  1. (1) Let the entire team focus on completing the migration from Spark to TensorFlow, and then conduct research and development of new models and new functions on the new platform.

  2. (2) Some team members use the mature and stable Spark platform to continue the development and quickly meet product requirements, leaving sufficient time for TensorFlow migration. At the same time, another group of members fully work on TensorFlow to ensure the maturity of the new platform before mass migration.

From a purely technical point of view, since it has been decided to migrate to the TensorFlow platform, theoretically there is no need to spend time developing new models using the Spark platform. However, we need to clarify two key considerations here:

  1. (1) No matter how mature the platform is, it always takes a long time for the entire team to break in and conduct the tuning. It is impossible to let it support important business logic right after the migration.

  2. (2) The technology platform migration is usually a technical decision and requires transparency with the other stakeholders. However, it should not be a direct reason for deprioritizing the business support.

Therefore, from the perspective of project progress and risk, the second technical path should be a more realistic choice for project development.

6.6.4 Trade-Offs between Hardware Platform and Model Structure

Almost all machine learning engineers have had a similar complaint – the company’s hardware resources are too limited, and it takes nearly a day to train a model. Big companies may have relatively more resources, and small companies are more likely to be restricted by hardware constraints due to the limitation of research and development budget. But no matter what size of company, hardware resources are always limited, so we must learn to optimize all engineering implementations related to the model under the condition of limited hardware resources.

The optimization here actually includes two aspects:

  1. (1) One is the optimization of the program itself. We can always hear the complaints that Spark runs too slowly. Sometimes, it is due to the lack of deep understanding of Spark’s shuffle mechanism. Such problematic programs usually involve lots of unnecessary data shuffling due to the data skew problem. This kind of issue does not need the technical trade-offs, but requires the engineer to consolidate its technical foundation and understanding of the mechanisms.

  2. (2) In the other scenarios, optimization means some technical trade-offs. Is it possible to greatly improve the speed of model training, reduce the consumption of model training, and improve the online performance of the recommended model by optimizing or simplifying the model structure? Typical cases are mentioned in Section 5.3. In the deep learning model, the overall training convergence speed of the model has a strong correlation with the number of parameters of the model, and the embedding layer usually takes a major portion of the model parameters. Therefore, in order to speed up the training of the model, the embedding components can be pre-trained separately, which can achieve rapid convergence of the other model components. Of course, this approach abandons the consistency of end-to-end training. But under the constraints of hardware conditions, the benefits of enhancing the online performance of the model may be much greater than the benefits of model consistency brought by end-to-end training. Another example is the model structure simplification. If the gains brought by increasing the complexity of the model have become very small, there is no need to waste too many hardware resources to make marginal improvements. Instead, the optimization should be re-directed to improving the online system performance, mining other useful information, and introducing a more effective network structure for the model.

6.6.5 Balance between Whole Picture and Details

These cases cannot cover all the engineering trade-offs people make in practice. We only hope that readers can build a good engineering intuition through these cases, jump out of very specific technical details, and balance between the whole picture and details.

This chapter is only a small part of engineering implementation in deep learning recommender systems. However, if readers can start from this, establish an overall understanding of recommender systems engineering architecture, and understand the principles, advantages, and disadvantages of various technical approaches, it is the beginning of becoming an excellen recommendation engineer.

References

Chang, Fay, et al. Bigtable: A distributed storage system for structured data. ACM transactions on Computer Systems (TOCS), 26.2 (2008): 4.Google Scholar
Ghemawat, Sanjay, Gobioff, Howard, Leung, Shun-Tak. The Google file system. 2003.Google Scholar
Dean, Jeffrey, Ghemawat, Sanjay. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51.1, 2008: 107–113.Google Scholar
Li, Mu, et al. Scaling distributed machine learning with the parameter server. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO, USA, October 6–8, 2014.Google Scholar
Li, Mu, et al. Parameter server for distributed machine learning. Big Learning NIPS Workshop, 6(2), 2013.Google Scholar
Abadi, Martín, et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems: arXiv preprint arXiv: 1603.04467 (2016).Google Scholar
Abadi, Martín, et al. TensorFlow: A system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, November 2–4, 2016.Google Scholar
Figure 0

Figure 6.1 Schematic diagram of batch processing.

Figure 1

Figure 6.2 Schematic diagram of stream processing.

Figure 2

Figure 6.3 Schematic diagram of Lambda architecture.

Figure 3

Figure 6.4 Schematic diagram of Kappa architecture.

Figure 4

Figure 6.5 Integration of big data platform and recommender systems.Apache Hadoop, Hadoop, the Apache Hadoop Logo, Apache Flink, Flink, the Apache Flink Logo, Apache Spark, Spark, the Apache Spark Logo and Apache are either registered trademarks or trademarks of the Apache Software Foundation. Redis is a registered trademark of Redis Ltd. Any rights therein are reserved to Redis Ltd. Any use by Cambridge is for referential purposes only and does not indicate any sponsorship, endorsement or affiliation between Redis and Cambridge.

Figure 5

Figure 6.6 Spark architecture diagram.

Figure 6

Figure 6.7 A DAG example of one Spark job.

Figure 7

Figure 6.8 DAG split by shuffle operations.

Figure 8

Code 6.1 Parallel gradient descent process in Parameter Server

Figure 9

Figure 6.9 Architecture of Parameter Server.

Figure 10

Figure 6.10 Schematic diagram of the parallel training process in Parameter Server.

Figure 11

Figure 6.11 The process of calculating the gradient in multiple iterations.

Figure 12

Figure 6.12 Comparison of computing time and waiting time between synchronized and asynchronized strategies.

Figure 13

Figure 6.13 Convergence speed of different strategies.

Figure 14

Figure 6.14 Consistent hashing ring composed of server nodes.

Figure 15

Figure 6.15 A simple TensorFlow-directed graph.

Figure 16

Figure 6.16 An example of a task relationship graph given by TensorFlow.

Figure 17

Figure 6.17 Single-machine and distributed training environments for TensorFlow.

Figure 18

Figure 6.18 Utilizing PMML for model deployment in the Spark MLlib framework.

Accessibility standard: Unknown

Accessibility compliance for the HTML of this book is currently unknown and may be updated in the future.

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge-org.demo.remotlog.com is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×