Using BertServer

Installation

The best way to install the server is via pip. Note that the server can be installed separately from BertClient or even on a different machine:

pip install bert-serving-server

Warning

The server MUST be running on Python >= 3.5 with Tensorflow >= 1.10 (one-point-ten). Again, the server does not support Python 2!

Command Line Interface

Once installed, you can use the command line interface to start a bert server:

bert-serving-server -model_dir /uncased_bert_model -num_worker 4

Server-side API

Server-side is a CLI bert-serving-start, you can get the latest usage via:

bert-serving-start --help

Start a BertServer for serving

usage: bert-serving-server [-h] -model_dir MODEL_DIR
                           [-tuned_model_dir TUNED_MODEL_DIR]
                           [-ckpt_name CKPT_NAME] [-config_name CONFIG_NAME]
                           [-graph_tmp_dir GRAPH_TMP_DIR]
                           [-max_seq_len MAX_SEQ_LEN] [-cased_tokenization]
                           [-pooling_layer POOLING_LAYER [POOLING_LAYER ...]]
                           [-pooling_strategy {NONE,REDUCE_MAX,REDUCE_MEAN,REDUCE_MEAN_MAX,FIRST_TOKEN,LAST_TOKEN,CLS_POOLED,CLASSIFICATION,REGRESSION}]
                           [-mask_cls_sep] [-no_special_token]
                           [-show_tokens_to_client] [-no_position_embeddings]
                           [-num_labels NUM_LABELS] [-port PORT]
                           [-port_out PORT_OUT] [-http_port HTTP_PORT]
                           [-http_max_connect HTTP_MAX_CONNECT] [-cors CORS]
                           [-num_worker NUM_WORKER]
                           [-max_batch_size MAX_BATCH_SIZE]
                           [-priority_batch_size PRIORITY_BATCH_SIZE] [-cpu]
                           [-xla] [-fp16]
                           [-gpu_memory_fraction GPU_MEMORY_FRACTION]
                           [-device_map DEVICE_MAP [DEVICE_MAP ...]]
                           [-prefetch_size PREFETCH_SIZE]
                           [-fixed_embed_length] [-verbose] [-version]

Named Arguments

-verbose

turn on tensorflow logging for debug

Default: False

-version show program’s version number and exit

File Paths

config the path, checkpoint and filename of a pretrained/fine-tuned BERT model

-model_dir directory of a pretrained BERT model
-tuned_model_dir
 directory of a fine-tuned BERT model
-ckpt_name

filename of the checkpoint file. By default it is “bert_model.ckpt”, but for a fine-tuned model the name could be different.

Default: “bert_model.ckpt”

-config_name

filename of the JSON config file for BERT model.

Default: “bert_config.json”

-graph_tmp_dir path to graph temp file

BERT Parameters

config how BERT model and pooling works

-max_seq_len

maximum length of a sequence, longer sequence will be trimmed on the right side. set it to NONE for dynamically using the longest sequence in a (mini)batch.

Default: 25

-cased_tokenization
 

Whether tokenizer should skip the default lowercasing and accent removal.Should be used for e.g. the multilingual cased pretrained BERT model.

Default: True

-pooling_layer

the encoder layer(s) that receives pooling. Give a list in order to concatenate several layers into one

Default: [-2]

-pooling_strategy
 

Possible choices: NONE, REDUCE_MAX, REDUCE_MEAN, REDUCE_MEAN_MAX, FIRST_TOKEN, LAST_TOKEN, CLS_POOLED, CLASSIFICATION, REGRESSION

the pooling strategy for generating encoding vectors

Default: REDUCE_MEAN

-mask_cls_sep

masking the embedding on [CLS] and [SEP] with zero. When pooling_strategy is in {CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN} then the embedding is preserved, otherwise the embedding is masked to zero before pooling

Default: False

-no_special_token
 

add [CLS] and [SEP] in every sequence, put sequence to the model without [CLS] and [SEP] when True and is_tokenized=True in Client

Default: False

-show_tokens_to_client
 

sending tokenization results to client

Default: False

-no_position_embeddings
 

Whether to add position embeddings for the position of each token in the sequence.

Default: False

-num_labels

Numbers of Label

Default: 2

Serving Configs

config how server utilizes GPU/CPU resources

-port, -port_in, -port_data
 

server port for receiving data from client

Default: 5555

-port_out, -port_result
 

server port for sending result to client

Default: 5556

-http_port server port for receiving HTTP requests
-http_max_connect
 

maximum number of concurrent HTTP connections

Default: 10

-cors

setting “Access-Control-Allow-Origin” for HTTP requests

Default: “*”

-num_worker

number of server instances

Default: 1

-max_batch_size
 

maximum number of sequences handled by each worker

Default: 256

-priority_batch_size
 

batch smaller than this size will be labeled as high priority,and jumps forward in the job queue

Default: 16

-cpu

running on CPU (default on GPU)

Default: False

-xla

enable XLA compiler (experimental)

Default: False

-fp16

use float16 precision (experimental)

Default: False

-gpu_memory_fraction
 

determine the fraction of the overall amount of memory that each visible GPU should be allocated per worker. Should be in range [0.0, 1.0]

Default: 0.5

-device_map

specify the list of GPU device ids that will be used (id starts from 0). If num_worker > len(device_map), then device will be reused; if num_worker < len(device_map), then device_map[:num_worker] will be used

Default: []

-prefetch_size

the number of batches to prefetch on each worker. When running on a CPU-only machine, this is set to 0 for comparability

Default: 10

-fixed_embed_length
 

when “max_seq_len” is set to None, the server determines the “max_seq_len” according to the actual sequence lengths within each batch. When “pooling_strategy=NONE”, this may cause two “.encode()” from the same client results in different sizes [B, T, D].Turn this on to fix the “T” in [B, T, D] to “max_position_embeddings” in bert json config.

Default: False

Server-side Benchmark

If you want to benchmark the speed, you may use:

bert-serving-benchmark --help

Benchmark BertServer locally

usage: bert-serving-benchmark [-h] -model_dir MODEL_DIR
                              [-tuned_model_dir TUNED_MODEL_DIR]
                              [-ckpt_name CKPT_NAME]
                              [-config_name CONFIG_NAME]
                              [-graph_tmp_dir GRAPH_TMP_DIR]
                              [-max_seq_len MAX_SEQ_LEN] [-cased_tokenization]
                              [-pooling_layer POOLING_LAYER [POOLING_LAYER ...]]
                              [-pooling_strategy {NONE,REDUCE_MAX,REDUCE_MEAN,REDUCE_MEAN_MAX,FIRST_TOKEN,LAST_TOKEN,CLS_POOLED,CLASSIFICATION,REGRESSION}]
                              [-mask_cls_sep] [-no_special_token]
                              [-show_tokens_to_client]
                              [-no_position_embeddings]
                              [-num_labels NUM_LABELS] [-port PORT]
                              [-port_out PORT_OUT] [-http_port HTTP_PORT]
                              [-http_max_connect HTTP_MAX_CONNECT]
                              [-cors CORS] [-num_worker NUM_WORKER]
                              [-max_batch_size MAX_BATCH_SIZE]
                              [-priority_batch_size PRIORITY_BATCH_SIZE]
                              [-cpu] [-xla] [-fp16]
                              [-gpu_memory_fraction GPU_MEMORY_FRACTION]
                              [-device_map DEVICE_MAP [DEVICE_MAP ...]]
                              [-prefetch_size PREFETCH_SIZE]
                              [-fixed_embed_length] [-verbose] [-version]
                              [-test_client_batch_size [TEST_CLIENT_BATCH_SIZE [TEST_CLIENT_BATCH_SIZE ...]]]
                              [-test_max_batch_size [TEST_MAX_BATCH_SIZE [TEST_MAX_BATCH_SIZE ...]]]
                              [-test_max_seq_len [TEST_MAX_SEQ_LEN [TEST_MAX_SEQ_LEN ...]]]
                              [-test_num_client [TEST_NUM_CLIENT [TEST_NUM_CLIENT ...]]]
                              [-test_pooling_layer [TEST_POOLING_LAYER [TEST_POOLING_LAYER ...]]]
                              [-wait_till_ready WAIT_TILL_READY]
                              [-client_vocab_file CLIENT_VOCAB_FILE]
                              [-num_repeat NUM_REPEAT]

Named Arguments

-verbose

turn on tensorflow logging for debug

Default: False

-version show program’s version number and exit

File Paths

config the path, checkpoint and filename of a pretrained/fine-tuned BERT model

-model_dir directory of a pretrained BERT model
-tuned_model_dir
 directory of a fine-tuned BERT model
-ckpt_name

filename of the checkpoint file. By default it is “bert_model.ckpt”, but for a fine-tuned model the name could be different.

Default: “bert_model.ckpt”

-config_name

filename of the JSON config file for BERT model.

Default: “bert_config.json”

-graph_tmp_dir path to graph temp file

BERT Parameters

config how BERT model and pooling works

-max_seq_len

maximum length of a sequence, longer sequence will be trimmed on the right side. set it to NONE for dynamically using the longest sequence in a (mini)batch.

Default: 25

-cased_tokenization
 

Whether tokenizer should skip the default lowercasing and accent removal.Should be used for e.g. the multilingual cased pretrained BERT model.

Default: True

-pooling_layer

the encoder layer(s) that receives pooling. Give a list in order to concatenate several layers into one

Default: [-2]

-pooling_strategy
 

Possible choices: NONE, REDUCE_MAX, REDUCE_MEAN, REDUCE_MEAN_MAX, FIRST_TOKEN, LAST_TOKEN, CLS_POOLED, CLASSIFICATION, REGRESSION

the pooling strategy for generating encoding vectors

Default: REDUCE_MEAN

-mask_cls_sep

masking the embedding on [CLS] and [SEP] with zero. When pooling_strategy is in {CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN} then the embedding is preserved, otherwise the embedding is masked to zero before pooling

Default: False

-no_special_token
 

add [CLS] and [SEP] in every sequence, put sequence to the model without [CLS] and [SEP] when True and is_tokenized=True in Client

Default: False

-show_tokens_to_client
 

sending tokenization results to client

Default: False

-no_position_embeddings
 

Whether to add position embeddings for the position of each token in the sequence.

Default: False

-num_labels

Numbers of Label

Default: 2

Serving Configs

config how server utilizes GPU/CPU resources

-port, -port_in, -port_data
 

server port for receiving data from client

Default: 5555

-port_out, -port_result
 

server port for sending result to client

Default: 5556

-http_port server port for receiving HTTP requests
-http_max_connect
 

maximum number of concurrent HTTP connections

Default: 10

-cors

setting “Access-Control-Allow-Origin” for HTTP requests

Default: “*”

-num_worker

number of server instances

Default: 1

-max_batch_size
 

maximum number of sequences handled by each worker

Default: 256

-priority_batch_size
 

batch smaller than this size will be labeled as high priority,and jumps forward in the job queue

Default: 16

-cpu

running on CPU (default on GPU)

Default: False

-xla

enable XLA compiler (experimental)

Default: False

-fp16

use float16 precision (experimental)

Default: False

-gpu_memory_fraction
 

determine the fraction of the overall amount of memory that each visible GPU should be allocated per worker. Should be in range [0.0, 1.0]

Default: 0.5

-device_map

specify the list of GPU device ids that will be used (id starts from 0). If num_worker > len(device_map), then device will be reused; if num_worker < len(device_map), then device_map[:num_worker] will be used

Default: []

-prefetch_size

the number of batches to prefetch on each worker. When running on a CPU-only machine, this is set to 0 for comparability

Default: 10

-fixed_embed_length
 

when “max_seq_len” is set to None, the server determines the “max_seq_len” according to the actual sequence lengths within each batch. When “pooling_strategy=NONE”, this may cause two “.encode()” from the same client results in different sizes [B, T, D].Turn this on to fix the “T” in [B, T, D] to “max_position_embeddings” in bert json config.

Default: False

Benchmark parameters

config the experiments of the benchmark

-test_client_batch_size
 Default: [1, 16, 256, 4096]
-test_max_batch_size
 Default: [8, 32, 128, 512]
-test_max_seq_len
 Default: [32, 64, 128, 256]
-test_num_client
 Default: [1, 4, 16, 64]
-test_pooling_layer
 Default: [[-1], [-2], [-3], [-4], [-5], [-6], [-7], [-8], [-9], [-10], [-11], [-12]]
-wait_till_ready
 

seconds to wait until server is ready to serve

Default: 30

-client_vocab_file
 

file path for building client vocabulary

Default: “README.md”

-num_repeat

number of repeats per experiment (must >2), as the first two results are omitted for warm-up effect

Default: 10