Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/encoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,20 +53,20 @@ python -m tevatron.retriever.driver.encode \
```

> Here we are using our self-contained datasets to train.
> To use custom dataset, replace `--dataset_name Tevatron/wikipedia-nq-corpus` by
> `--encode_in_path <file to encode>`. (see here for details)
> To use custom dataset, set `--dataset_name json` and pass
> `--dataset_path <file to encode>`. (see here for details)

## Encoding on TPU (JAX / Flax)

[`tevatron.retriever.driver.jax_encode`](../src/tevatron/retriever/driver/jax_encode.py) is an optional JAX path with a **different** CLI than the PyTorch encoder above.

I.e. the following command will do same thing as above but with Jax/Flax:
```
python -m tevatron.driver.jax_encode \
python -m tevatron.retriever.driver.jax_encode \
--output_dir=temp \
--model_name_or_path model_nq \
--per_device_eval_batch_size 156 \
--passage_max_len 128 \
--dataset_name Tevatron/wikipedia-nq-corpus \
--encode_output_path corpus_emb.pkl
```
```
17 changes: 7 additions & 10 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,21 +65,21 @@ GradCache also works with multi-GPU `torchrun` setups.

## Training with TPU
Tevatron implements TPU training via Jax/Flax.
We provide a separate module `tevatron.driver.jax_train` to train on TPU.
We provide a separate module `tevatron.retriever.driver.jax_train` to train on TPU.
The arguments managements aligns with above Pytorch training driver.

By running the following commands on a V3-8 TPU VM is equivalent to the commands above.
```bash
python -m tevatron.driver.jax_train \
python -m tevatron.retriever.driver.jax_train \
--output_dir model_nq \
--dataset_name Tevatron/wikipedia-nq \
--model_name_or_path bert-base-uncased \
--do_train \
--per_device_train_batch_size 16 \
--train_group_size 2 \
--learning_rate 1e-5 \
--q_max_len 32 \
--p_max_len 156 \
--query_max_len 32 \
--passage_max_len 156 \
--num_train_epochs 40
```
> Note that our Jax training driver also support gradient cache by adding `--grad_cache` option.
Expand All @@ -98,16 +98,13 @@ Here we describe the details of the arguments additionally defined for Tevatron'
| `tokenizer_name` | Tokenizer name or path if not the same as `model_name_or_path` | `str` | same as `model_name_or_path` | pytorch, jax |
| `cache_dir` | Path to the directory to save the cache of models and datasets | `str` | `~/.cache/` | pytorch, jax |
| `untie_encoder` | Whether query encoder and passage encoder share same parameter | `bool` | `False` | pytorch, jax |
| `add_pooler` | Whether add pooler on top of last layer output | `bool` | `False` | pytorch |
| `projection_in_dim` | The input dim of pooler | `int` | `768` | |
| `projection_out_dim` | The output dim of pooler | `int` | `768` | pytorch |
| `dataset_name` | Dataset name that avaliable on HuggingFace | `str` | `json` | pytorch, jax |
| `train_dir` | Directory that stores custom training data | `str` | `None` | pytorch, jax |
| `dataset_proc_num` | Number of threads to use to preprocess/tokenize data | `int` | `12` | pytorch, jax |
| `dataset_path` | Path to local data files or directory | `str` | `None` | pytorch, jax |
| `num_proc` | Number of threads to use to preprocess/tokenize data | `int` | `1` | pytorch, jax |
| `train_group_size` | Number of passages for each anchor query during training. It will load 1 positive passage + (`train_group_size`-1) negative passages for each example during training | `int` | `8` | pytorch, jax |
| `passage_field_separator` | The token to seperate `title` and `text` field for passages | `str` | `" "` | pytorch |
| `query_max_len` | Maximum query length | `int` | `32` | pytorch, jax |
| `passage_max_len` | Maximum passage length | `int` | `128` | pytorch, jax |
| `grad_cache` | Whether use gradient cache feature. This can be used to support large batch size while GPU/TPU memory are limited. | `bool` | `False` | pytorch, jax |
| `gc_q_chunk_size` | Sub-batch size for queries with `grad_cache` | `int` | `4` | pytorch, jax |
| `gc_p_chunk_size` | Sub-batch size for passages with `grad_cache` | `int` | `32` | pytorch, jax |
| `gc_p_chunk_size` | Sub-batch size for passages with `grad_cache` | `int` | `32` | pytorch, jax |
110 changes: 59 additions & 51 deletions examples/coCondenser-marco/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,31 +27,33 @@ mkdir -p encoding/corpus
mkdir -p encoding/query
for i in $(seq -f "%02g" 0 9)
do
python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir ./retriever_model \
--model_name_or_path Luyu/co-condenser-marco-retriever \
--fp16 \
--per_device_eval_batch_size 128 \
--encode_in_path marco/bert/corpus/split${i}.json \
--encoded_save_path encoding/corpus/split${i}.pt
--dataset_name json \
--dataset_path marco/bert/corpus/split${i}.json \
--encode_output_path encoding/corpus/split${i}.pt
done


python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir ./retriever_model \
--model_name_or_path Luyu/co-condenser-marco-retriever \
--fp16 \
--q_max_len 32 \
--encode_is_qry \
--query_max_len 32 \
--encode_is_query \
--per_device_eval_batch_size 128 \
--encode_in_path marco/bert/query/dev.query.json \
--encoded_save_path encoding/query/qry.pt
--dataset_name json \
--dataset_path marco/bert/query/dev.query.json \
--encode_output_path encoding/query/qry.pt
```
### Index Search
```
python -m tevatron.faiss_retriever \
--query_reps encoding/query/qry.pt \
--passage_reps encoding/corpus/'*.pt' \
python -m tevatron.retriever.driver.search \
--query_reps encoding/query/qry.pt \
--passage_reps encoding/corpus/'*.pt' \
--depth 10 \
--batch_size -1 \
--save_text \
Expand All @@ -65,15 +67,16 @@ python ../msmarco-passage-ranking/score_to_marco.py rank.txt
Pick a pre-trained condenser that is most suitable for the experiment from [Condenser Repo](https://github.com/luyug/Condenser#pre-trained-models).
Train
```
python -m tevatron.driver.train \
--output_dir ./retriever_model_s1 \
--model_name_or_path CONDENSER_MODEL_NAME \
--save_steps 20000 \
--train_dir ./marco/bert/train \
--fp16 \
--per_device_train_batch_size 8 \
--learning_rate 5e-6 \
--num_train_epochs 3 \
python -m tevatron.retriever.driver.train \
--output_dir ./retriever_model_s1 \
--model_name_or_path CONDENSER_MODEL_NAME \
--save_steps 20000 \
--dataset_name json \
--dataset_path "./marco/bert/train/*.json" \
--fp16 \
--per_device_train_batch_size 8 \
--learning_rate 5e-6 \
--num_train_epochs 3 \
--dataloader_num_workers 2
```
## Mining Hard Negatives
Expand All @@ -84,31 +87,33 @@ mkdir -p encoding/corpus
mkdir -p encoding/query
for i in $(seq -f "%02g" 0 9)
do
python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir ./retriever_model \
--model_name_or_path ./retriever_model_s1 \
--fp16 \
--per_device_eval_batch_size 128 \
--encode_in_path marco/bert/corpus/split${i}.json \
--encoded_save_path encoding/corpus/split${i}.pt
--dataset_name json \
--dataset_path marco/bert/corpus/split${i}.json \
--encode_output_path encoding/corpus/split${i}.pt
done

python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir ./retriever_model \
--model_name_or_path ./retriever_model_s1 \
--fp16 \
--q_max_len 32 \
--encode_is_qry \
--query_max_len 32 \
--encode_is_query \
--per_device_eval_batch_size 128 \
--encode_in_path marco/bert/query/train.query.json \
--encoded_save_path encoding/query/train.pt
--dataset_name json \
--dataset_path marco/bert/query/train.query.json \
--encode_output_path encoding/query/train.pt
```

### Search
```
python -m tevatron.faiss_retriever \
--query_reps encoding/query/train.pt \
--passage_reps encoding/corpus/'*.pt' \
python -m tevatron.retriever.driver.search \
--query_reps encoding/query/train.pt \
--passage_reps encoding/corpus/'*.pt' \
--batch_size 5000 \
--save_text \
--save_ranking_to train.rank.tsv
Expand All @@ -121,15 +126,16 @@ bash create_hn.sh

## Fine-tuning Stage 2
```
python -m tevatron.driver.train \
--output_dir ./retriever_model_s2 \
--model_name_or_path CONDENSER_MODEL_NAME \
--save_steps 20000 \
--train_dir ./marco/bert/train-hn \
--fp16 \
--per_device_train_batch_size 8 \
--learning_rate 5e-6 \
--num_train_epochs 2 \
python -m tevatron.retriever.driver.train \
--output_dir ./retriever_model_s2 \
--model_name_or_path CONDENSER_MODEL_NAME \
--save_steps 20000 \
--dataset_name json \
--dataset_path "./marco/bert/train-hn/*.json" \
--fp16 \
--per_device_train_batch_size 8 \
--learning_rate 5e-6 \
--num_train_epochs 2 \
--dataloader_num_workers 2
```

Expand All @@ -140,30 +146,32 @@ mkdir -p encoding/corpus-s2
mkdir -p encoding/query-s2
for i in $(seq -f "%02g" 0 9)
do
python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir ./retriever_model_s2 \
--model_name_or_path ./retriever_model_s2 \
--fp16 \
--per_device_eval_batch_size 128 \
--encode_in_path marco/bert/corpus/split${i}.json \
--encoded_save_path encoding/corpus-s2/split${i}.pt
--dataset_name json \
--dataset_path marco/bert/corpus/split${i}.json \
--encode_output_path encoding/corpus-s2/split${i}.pt
done

python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir ./retriever_model_s2 \
--model_name_or_path ./retriever_model_s2 \
--fp16 \
--q_max_len 32 \
--encode_is_qry \
--query_max_len 32 \
--encode_is_query \
--per_device_eval_batch_size 128 \
--encode_in_path marco/bert/query/dev.query.json \
--encoded_save_path encoding/query-s2/qry.pt
--dataset_name json \
--dataset_path marco/bert/query/dev.query.json \
--encode_output_path encoding/query-s2/qry.pt
```
Run the retriever,
```
python -m tevatron.faiss_retriever \
--query_reps encoding/query-s2/qry.pt \
--passage_reps encoding/corpus-s2/'*.pt' \
python -m tevatron.retriever.driver.search \
--query_reps encoding/query-s2/qry.pt \
--passage_reps encoding/corpus-s2/'*.pt' \
--depth 10 \
--batch_size -1 \
--save_text \
Expand Down
51 changes: 26 additions & 25 deletions examples/coCondenser-nq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,20 @@ python prepare_wiki_train.py --input hn.json --output nq-train/hn.bert.json --to
Pick a pre-trained condenser that is most suitable for the experiment from [Condenser Repo](https://github.com/luyug/Condenser#pre-trained-models).
Run training,
```
python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train \
python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.retriever.driver.train \
--output_dir model-nq \
--model_name_or_path CONDENSER_MODEL_NAME \
--do_train \
--save_steps 20000 \
--train_dir nq-train \
--dataset_name json \
--dataset_path "nq-train/*.json" \
--fp16 \
--per_device_train_batch_size 32 \
--train_n_passages 2 \
--train_group_size 2 \
--learning_rate 5e-6 \
--q_max_len 32 \
--p_max_len 256 \
--query_max_len 32 \
--passage_max_len 256 \
--num_train_epochs 40 \
--negatives_x_device \
--untie_encoder \
--positive_passage_no_shuffle
```
Expand All @@ -84,20 +84,20 @@ python prepare_wiki_train.py --input hn.json --output nq-train/hn.bert.json --to
Pick a pre-trained condenser that is most suitable for the experiment from [Condenser Repo](https://github.com/luyug/Condenser#pre-trained-models).
Run training,
```
python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train \
python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.retriever.driver.train \
--output_dir model-nq \
--model_name_or_path CONDENSER_MODEL_NAME \
--do_train \
--save_steps 20000 \
--train_dir nq-train \
--dataset_name json \
--dataset_path "nq-train/*.json" \
--fp16 \
--per_device_train_batch_size 32 \
--train_n_passages 2 \
--train_group_size 2 \
--learning_rate 5e-6 \
--q_max_len 32 \
--p_max_len 256 \
--query_max_len 32 \
--passage_max_len 256 \
--num_train_epochs 20 \
--negatives_x_device \
--untie_encoder \
--positive_passage_no_shuffle
```
Expand All @@ -111,32 +111,33 @@ MODEL_DIR=nq-model

for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--config_name CONDENSER_MODEL_NAME \
--output_dir=$OUTDIR \
--model_name_or_path $MODEL_DIR \
--fp16 \
--per_device_eval_batch_size 64 \
--p_max_len 256 \
--dataset_proc_num 8 \
--passage_max_len 256 \
--num_proc 8 \
--dataset_name Tevatron/wikipedia-nq-corpus \
--encoded_save_path embeddings-nq/$s.pt \
--encode_num_shard 20 \
--encode_output_path embeddings-nq/$s.pt \
--dataset_number_of_shards 20 \
--passage_field_separator sep_token \
--encode_shard_index $s
--dataset_shard_index $s
done

python -m tevatron.driver.encode \
python -m tevatron.retriever.driver.encode \
--output_dir=$OUTDIR \
--model_name_or_path $MODEL_DIR \
--config_name CONDENSER_MODEL_NAME \
--fp16 \
--per_device_eval_batch_size 64 \
--q_max_len 32 \
--dataset_proc_num 2 \
--dataset_name Tevatron/wikipedia-nq/test \
--encoded_save_path embeddings-nq-queries/query.pt \
--encode_is_qry
--query_max_len 32 \
--num_proc 2 \
--dataset_name Tevatron/wikipedia-nq \
--dataset_split test \
--encode_output_path embeddings-nq-queries/query.pt \
--encode_is_query
```

## Search and Evaluation
Expand All @@ -146,7 +147,7 @@ ENCODE_QRY_DIR=embeddings-nq-queries
ENCODE_DIR=embeddings-nq
DEPTH=200
RUN=run.nq.test.txt
python -m tevatron.faiss_retriever \
python -m tevatron.retriever.driver.search \
--query_reps $ENCODE_QRY_DIR/query.pt \
--passage_reps $ENCODE_DIR/'*.pt' \
--depth $DEPTH \
Expand Down
11 changes: 6 additions & 5 deletions examples/example_dpr.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ python -m tevatron.retriever.driver.encode \
--fp16 \
--per_device_eval_batch_size 156 \
--dataset_name Tevatron/wikipedia-nq-corpus \
--encoded_save_path corpus_emb.$s.pkl \
--encode_num_shard 20 \
--encode_shard_index $s
--encode_output_path corpus_emb.$s.pkl \
--dataset_number_of_shards 20 \
--dataset_shard_index $s
done
```

Expand All @@ -72,8 +72,9 @@ python -m tevatron.retriever.driver.encode \
--model_name_or_path model_nq \
--fp16 \
--per_device_eval_batch_size 156 \
--dataset_name Tevatron/wikipedia-nq/test \
--encoded_save_path query_emb.pkl \
--dataset_name Tevatron/wikipedia-nq \
--dataset_split test \
--encode_output_path query_emb.pkl \
--encode_is_query
```

Expand Down
Loading