Today we’re gonna target our Renewal Date Extractor, when a speaker mentions a renewal date coming up, we would like to extract it. Most likely, there is no date and we need to take that into account.
We’ll cover some examples and then present our current prompt crafted by our fearless CEO: Masha Krol!
As usual, please let us know any tips and tricks you might have for this particular task!
Snippet | Meeting Date | Answer |
---|---|---|
So our renewal is coming up on December 1st. | 2023-02-01 | 2023-12-01 |
As we’ve discussed, our last day with JetBrains is next month. | 2022-09-01 | 2022-10-01 |
Our contract is coming up on the 16th of January I think. | 2023-11-04 | 2024-01-16 |
I don’t think I know the renewal date for Furia. | 2023-04-29 | NA |
Please note, that the Agent must output “NA” when there is no answer. This has been proven difficult to handle for common techniques like regexes, or more advanced NLP techniques.
Moreover, the way customer express their renewal comes in such a variety of manners that using LLMs is the perfect choice for a quick implementation.
After many iterations, we settled on this Agent crafted by Masha Krol
Renewal Date Extractor
Prompt:
Your task is to extract the renewal date for a product from a small portion of a transcribed meeting recording.
You should output the date of the renewal, taking into account the date of the meeting.
If you cannot find the exact day of renewal, output the first of that month. You must output "NA" when you do not know the answer.
Example:
Input: "Our Microsoft contract is up for renewal in June", the meeting date is February 7, 2024.
Answer: June 1, 2024
Input: "Our Apple contract is up for renewal in June", the meeting date is April 2, 2023.
Answer: June 1, 2023
Model: GPT-4
Run Instruction: the meeting date is {meetingDate}
Note that we do a bit of supervision by providing examples. After our evaluation, we found this approach results in better performance in almost all cases. This also allows us to provide instructions on the expected format which we can then parse using datetime.strptime.
If you’ve made it this far, let me know in the comment and tell me about common failure cases you see in this task!
See you next time,
Fred
]]>Today we will cover the prompt we’ve shown on Glowstick website. This prompt is the result of a group effort where everyone contributed, you can learn more by reading our blog. Again, congrats to Shruti Gupta, our designer who crafted this prompt!
TLDR
Our summarizing prompt for sales opportunities, crafted by Shruti Gupta:
You will read a portion of a longer conversation between a Sales and Customer success team and their customers. Summarize the main ideas from the snapshot of the conversation into a short summary not more than 10 words. Start the summary with the 'Customer name is'. If mentioned, highlight any interest in a product or service.
The run instructions are formatted as:
The customer is {customer name}, the product is {product name}
.
Glowstick is a platform that raises sales opportunities. Our task is to generate a headline from the snippet of conversation we detected as an opportunity. We have access to the conversation, the customer name and the product targeted by this opportunity.
Examples
Concerns
In this project, our primary concerns were 1. Accuracy, 2. Conciseness, 3. Headlines should target a specific product.
Evaluation
To evaluate ourselves against our previous iteration, we ran a voting session at Glowstick and this agent won 70% of the time across our test set against 5 other agents.
We use GPT4 Assistant API with the following prompt, and run instructions:
You will read a portion of a longer conversation between a Sales and Customer success team and their customers. Summarize the main ideas from the snapshot of the conversation into a short summary not more than 10 words. Start the summary with the 'Customer name is'. If mentioned, highlight any interest in a product or service.
Run Instructions
The customer is {customer name}, the product is {product name}
.
Customer_name
is the name of this customer (BlueSky in the examples above), product_name
is the name of the product (Webhooks, Multi-Language in the examples).
Analysis
A key item of our prompt is not more than 10 words
. Depending on the version of GPT you’re using, the LLM can ignore this part and output a full paragraph. We also observed that in some cases, the LLM would continue the conversation, ignoring the instruction altogether.
Finally, we observed that this prompt would rarely ignore product_name
even if it was wrong. Thus, our accuracy in product matching is more important than ever which triggered a new iteration of this system (that I will describe in a future post if there’s interest).
Now that we’re pretty happy with the state of sales opportunities summarization, we want to target new types of insights such as churn risks, CSQLs or even competitor mentions.
Tell us about your best summarization prompt and if you’ve tried our prompt let us know in the comments!
We would be eager to create a dataset for this particular task if people are up for it. Contact me on LinkedIn/Email.
You can learn more about what we do at glowstick.ai.
]]>My journey into the realm of ML began during the rise of Deep Learning. I immersed myself in learning Theano on deeplearning.net, and during that time, AlexNet reigned as the ultimate model for ImageNet. However, it was an interview at ElementAI that completely revolutionized my perspective on ML.
TLDR: Our expertise lies in using the right tool that brings value to your users, whether it’s an LLM or a regex.
During the interview, I was tasked with developing a system for GIF categorization for a well-known company that rhymes with Jiphy. My initial approach involved computing embeddings for all frames and using clustering techniques to assign categories to the GIFs. While this solution was deemed superior to those proposed by other interviewees, who wanted to train models on a limited set of predetermined categories, I realized that simplicity could be our ally.
After careful consideration, I decided to focus on the tags associated with the GIFs. Most GIFs come with hashtags that describe their content. For example, if a GIF has the hashtag #funny, the odds are higher that it is a funny GIF. This simple insight turned out to be the real solution.
This experience taught me a valuable lesson: as MLE professionals, our ultimate goal is to deliver features to customers as quickly as possible. It is better to have an okay-ish version that works reasonably well than to spend months developing a model that may never be used at all.
Now, let’s circle back to LLMs. Of course, we will continue to employ techniques such as KNNs, Sentence Embeddings, and trained classifiers. However, if we can create a semi-reliable prompt using an LLM in just one afternoon, it becomes the ideal way of showcasing new features to our users. This approach allows us to iterate more frequently and gain a better understanding of our customers’ needs.
At Glowstick, our ML pipeline is a fusion of various components, including Similarity search with SBERT, trained classifiers, regexes, and LLMs. We leverage what makes the most sense at any given time. In a startup environment, it’s highly likely that we will discard models along the way, but that’s just part of the journey. :)
In the upcoming weeks, I will share examples of simple cases where we opted to use LLMs instead of training our own models. I will delve into our rationale and discuss the tradeoffs we observed. And of course, I’ll include exciting prompts to illustrate our findings! Let me know in the comments if interests you!
Stay tuned for more intriguing insights!
]]>One of the worst experiences for a user is when a model makes a wrong prediction with high confidence, they would stop trusting the model as the confidence does not match the expected performance.
This is what we call calibration, we can compute how well a model is calibrated using the Expected calibration error or ECE. This metric can be summarized as the weighted average of the difference between a model’s confidence and its accuracy at multiple bins of confidence. Below, we have a visual explanation coming from the excellent paper of Guo et al. 2017.
We want to minimize the gaps in this diagram. In this post, we will improve a model’s calibration using HuggingFace and BaaL. The notebook is available here.
The HuggingFace ecosystem is simple to use and in just a few lines of code, we can have a pre-trained model and its associated dataset. We will use the well-known SST2 dataset along with a DistilBERT model.
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, TextClassificationPipeline, AutoModelForSequenceClassification
pipeline : TextClassificationPipeline = load_model("distilbert-base-uncased-finetuned-sst-2-english", use_cuda=False)
dataset : Dataset = load_dataset("glue", "sst2")["validation"]
BaaL is a Bayesian active learning library that will help us improve ECE.
To do so, we will use Bayesian deep learning to gather multiple predictions for the same input. The key idea is that by drawing multiple sets of weights from the posterior distribution, the average prediction will be better than a single. This is not unalike Ensembles, but without retraining, we will call this a Bayesian Ensemble. Generally, Ensembles are better but require more computational power.
While we have ways to separate the model’s uncertainty from the data’s uncertainty, we will focus on the predictive uncertainty which is ultimately what will affect the calibration of the model.
Next, we will compare the regular model’s ECE and its Bayesian alternative. BaaL will help us prepare the model and compute the ECE.
To prepare the model, we simply do:
from baal.bayesian.dropout import patch_module
pipeline.model = patch_module(pipeline.model)
This will modify the model of our loaded pipeline to use Dropout at test time. We now run the model 20 times and compute the average prediction before computing the ECE and the Accuracy. Below, we show the difference between both approaches.
Bayesian | Frequentist | |
---|---|---|
ECE | 0.063 | 0.0802 |
Accuracy | 0.903 | 0.910 |
Using 20 iterations, we improved our model’s calibration by a significant margin. This is quite good!
Let’s investigate how more iterations mean better ECE.
Testing our ECE at multiple iterations, we see that it converges quickly after ~40 iterations. While the accuracy takes a hit in the beginning, it quickly comes back. Of course, sampling brings noise to the prediction, but it stabilizes quickly with enough iterations.
Using a couple of lines of code, we can improve our model’s calibration. While we now require multiple predictions per input, the cost should not be too prohibitive for most cases. If you have access to large GPUs, I suggest duplicating your dataset and aggregating the predictions at the end.
I did this analysis on an academic dataset where Bayesian deep learning has been extensively studied. In my next blog post, I will analyze a dataset closer to real data: CLINC.
I have gone quickly over Bayesian deep learning and MC-Dropout so here are some resources if you want to know more:
Earlier, I mentioned model uncertainty versus data uncertainty, if you would like to know more I would recommend the following resources:
If you have any questions or suggestion, please contact me at:
I’m thinking of more blog posts combining HuggingFace and BaaL, let me know if that interests you!
]]>The code is available in this gist.
Instead of building a pipeline of Sequences (like in Keras-Transform), the data augmentation will be done inside one. So we should get something like this :
class MySequence(keras.utils.Sequence):
def __init__(self):
...
def __getitem__(self, idx):
X, y = ... # Get the inputs
for i in len(X):
rotation = get_random_rotation()
# Apply the same transformation to X and y
X[i] = rotate_image(X[i], rotation)
y[i] = rotate_image(y[i]. rotation)
return X, y
Thanks to vkk800, this API has made it into Keras and is now released. Here’s a simple example.
We first import OpenCV and the updated ImageDataGenerator
import cv2
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
We only need to inherit a Sequence. Since this is an example, this will only return a batch of cats.
class MySequence(Sequence):
def __init__(self):
self.path = '~Images/cat.jpg'
self.imgaug = ImageDataGenerator(rotation_range=20,
rescale=1/255.,
width_shift_range=10)
def __len__(self):
return 10
def __getitem__(self, idx):
X = np.array([cv2.resize(cv2.imread(self.path), (100, 100))
for _ in range(10)]).astype(np.float32) # Fake batch of cats
y = np.copy(X)
for i in range(len(X)):
# This creates a dictionary with the params
params = self.imgaug.get_random_transform(X[i].shape)
# We can now deterministicly augment all the images
X[i] = self.imgaug.apply_transform(self.imgaug.standardize(X[i]), params)
y[i] = self.imgaug.apply_transform(self.imgaug.standardize(y[i]), params)
return X, y
Using this Sequence
, we can get the following results. (X
is on the top row and y
on the second)
Making this work for a real-life application is really easy!
For example, we can use the params
variable in other situations like bounding boxes augmentation.
The code is available in this gist.
Thank you for reading and I’m always available on Keras’ Slack (@Dref360) if you have any question!
Cheers, Frédéric Branchaud-Charron
]]>The problem is that Tensorflow Sessions cannot be shared across processes. So you cannot have a web server that has a reference to your model and call methods on it. Also, Keras will not tell you what’s going on, because it doesn’t know. In fact, Tensorflow will just block without error.
Your model should only be used by a single worker. Easy enough?
Let’s first create a class that will handle our model. It should be thread-safe, because multiple workers will be accessing it.
from keras.applications import VGG16
from multiprocessing import Lock
import numpy as np
class KerasModel():
def __init__(self):
self.mutex = Lock()
self.model = None
def initialize(self):
"""Initialize our model"""
self.model = VGG16()
# Dummy compile
self.model.compile('sgd', 'mse')
def predict(self, arr):
"""This method uses VGG16 to predict an ImageNet class.
arr: Numpy array, the input image (should be of shape (224,224,3))
returns : A distribution over all 1000 ImageNet's classes.
"""
if arr.shape != (224, 224, 3):
raise ValueError('The image provided is not right.')
with self.mutex:
# With the mutex, we can now predict!
return self.model.predict_on_batch(arr[np.newaxis, ...])[0]
I’ll use a multiprocessing.Manager
to interact with our model. A manager already spawns inside another process and his purpose is do this kind of stuff.
Basically, a Manager is a simple server that provides methods to access its components. In our case, our model is our component.
To let the Manager use our model, we’ll need to register it.
from multiprocessing.managers import BaseManager
# Dummy class
class KerasManager(BaseManager):
pass
KerasManager.register('KerasModel', KerasModel)
That’s it!
We can now build our web server! I’ll use Flask, because it’s super easy to use, but choose the framework of your choice.
Let’s initialize our manager :
from keras_model import KerasManager
manager = KerasManager()
manager.start() # Important to start our server!
keras_model = manager.KerasModel()
keras_model.initialize() # Important to initialize our network!
It is now ready to be used inside the web server.
import cv2
import numpy as np
from flask import Flask, request
app = Flask(__name__)
@app.route('/hello', methods=['POST'])
def hello():
img = cv2.imdecode(np.fromstring(request.files['file'].read(), np.uint8), -1)
img = cv2.resize(img, (224, 224))
# With the Manager, we call our model!
pred = keras_model.predict(img)
# Get the associated ImageNet class
return imagenet_classes[np.argmax(pred)]
if __name__ == '__main__':
app.run()
Where imagenet_classes
is the dict
that we can find here.
Our web server has only one endpoint hello, which is a POST method that accepts a file with a key 'file'
.
The file is then read using OpenCV. We then feed the network and send back the response.
We can now try it with your favorite request manager. (Like Postman)
As we can see, it’s fairly easy to use a Keras model inside a Web server!
The code can be found on my repo: https://github.com/Dref360/tuto_keras_web
]]>create_model
and get_paths
should be created by you.
You will need Keras, Keras-Transform and OpenCV.
The quickest possible way to install OpenCV is through conda :
conda install -c menpo opencv3
Let’s begin by importing Keras, Keras-Transform and OpenCV. We’ll design our Sequence object in the most simple way. It will be provided by a list of paths to the input image and its segmentation mask. The Sequence will load the image and resize it.
import cv2
import numpy as np
from keras.utils import Sequence
from model import create_model # Function to create a model which does segmentation (U-Net, Tiramisu, etc)
from transform.sequences import (RandomRotationTransformer, SequentialTransformer,
RandomZoomTransformer, RandomHorizontalFlipTransformer)
from your_dataset import get_paths # Method which will return the path to both the image and its segmentation mask
INPUT_SHAPE = (300, 300)
BATCH_SIZE = 10
class YourDatasetSequence(Sequence):
def __init__(self, paths: [(str, str)], input_shape: (int, int), batch_size: int):
self.paths = paths
self.input_shape = input_shape # We assume that the input_shape is the same as the output
self.batch_size = batch_size
def __len__(self):
return len(self.paths)
def __getitem__(self, idx: int):
data = self.paths[idx * self.batch_size:(idx + 1) * self.batch_size]
X, mask = zip(*data) # Split the data between input and groundtruth
# Load and resize the images, we also apply some preprocessing.
X = [cv2.resize(cv2.imread(fi), self.input_shape) for fi in X]
X = [self.apply_preprocessing(x) for x in X]
# Load the masks
mask = np.array([cv2.resize(cv2.imread(fi), self.input_shape) for fi in mask])
return X, mask
def apply_preprocessing(self,x: np.array):
"""Dummy function that puts `x` between 0,1."""
return x / 255.
With this simple Sequence object, you are now able to data augmentation and multiprocessing loading. Now we need to instanciate those Sequences and apply the data augmentation to the training Sequence. Using Keras-Transform, we will be able to apply random transformations on both the input and the mask.
def get_sequences(paths, batch_size=1):
"""Will create the sequences from `paths`"""
# We split `paths` in 3 and then create 3 sequences.
train, val, test = np.split(paths, [0.6 * len(paths), 0.8 * len(paths)])
return YourDatasetSequence(train, INPUT_SHAPE, batch_size), \
YourDatasetSequence(val, INPUT_SHAPE, batch_size), \
YourDatasetSequence(test, INPUT_SHAPE, batch_size)
# We create our three Sequences
paths = get_paths() # [(str,str)]
train_seq, val_seq, test_seq = get_sequences(paths, BATCH_SIZE)
# We create a sequence of transformer to do DA
sequence = SequentialTransformer([RandomZoomTransformer(zoom_range=(0.8, 1.2)),
RandomHorizontalFlipTransformer()])
# By supplying a mask, we can perform transformation on the groundtruth as well.
train_seq = sequence(train_seq, mask=[True, True])
Now that we have everything in place, we only need to instanciate the model and train it!
# Create the model
model = create_model() # Model
# We can now train and evaluate!
model.fit_generator(train_seq, len(train_seq), validation_data=val_seq, validation_steps=len(val_seq),
use_multiprocessing=True, workers=6, epochs=15)
model.evaluate_generator(test_seq, len(test_seq), use_multiprocessing=True, workers=6)
In this post, we’ve created a pipeline for segmentation using Keras and Keras-Transform. With Sequences, we can safely train our model using multiprocessing. If you have any question or comment, feel free to e-mail me!
]]>While developing Sequence
for Keras
, I stumble upon an issue when using multiprocessing.Pool
.
When you use read-only structure like Sequence
, you expect them to be really fast. But, I was getting the opposite, Sequences were now 2-5 times slower than generators. In this post, I’ll show what is the problem and how to resolve it.
So let’s say you want to share a structure with some internal data like a Sequence.
class Sequence():
def __init__(self, my_list):
self.my_list = my_list
def __getitem__(self, item):
return self.my_list[item]
def __len__(self):
return len(self.my_list)
def get_item(seq, idx):
# Allows Python2 to pickle the Sequence
return seq[idx]
The list could be quite large, thousands of elements. To make it harder for Python
to correctly translate the list to a C array, we will make tuples of different types before creating a Sequence
to it.
files = ['test_string'] * 100000
# Make the test faster.
nb_files = min(len(list(files)), 100)
huge_list = ((x, [1, 3, 4]) for x in files)
seq = Sequence(list(huge_list))
Next, we have to initialize a multiprocessing.Pool
and a consumer-producer system.
We will create a Pool
of 5 workers to extracts data and a thread to enqueue the promises.
x = multiprocessing.Pool(5)
qu = Queue(10)
def run(qu):
for i in range(nb_files):
# Simply dequeue an item and wait for its result.
acc = qu.get(block=True).get()
th = threading.Thread(target=run, args=(qu,))
th.daemon = True
th.start()
Now that everything is set, we can start doing the test. We’ll see how much time it takes to extract 100 items from the queue.
st = time.time()
for i in range(nb_files):
qu.put(x.apply_async(func=get_item, args=(seq, i,)), block=True)
th.join()
print("Took list", time.time() - st)
# Took list 38.6540949344635
38 seconds to do such a simple task seems abnormal. The issue here is that passing seq
around is expensive since every call to get_item
will copy the seq
to the memory of the process.
To resolve this problem, we will share the list between every process of the Pool. To do that, we will use a Manager
.
A Manager is a little server that answers requests on the objects that it holds.
manager = multiprocessing.Manager()
big_list = ((x, [1, 3, 4]) for x in files)
ls = manager.list(list(big_list))
seq2 = Sequence(ls)
Here, we can directly use manager.list() which will create a shared list. Please note that you may need to create your own Proxy if you need other types to be shared.
We can now do some tests.
st = time.time()
for i in range(nb_files):
qu.put(x.apply_async(func=get_item, args=(seq2, i,)), block=True)
th.join()
print("Took Manager list", time.time() - st)
# Took Manager list 0.111
Now we’re talking! More than a 300x speedup for 4 lines of Python
.
In this post, we’ve shown a case where sharing an object between processes is better than just copying it. We’ve made a 300x improvement in about 4 lines of Python
that use multiprocessing.Manager
.
If you have used Keras extensively, you are probably aware that using
model.fit_generator(..., pickle_safe=True)
may not do what you expect.
In fact, generators cannot be shared across processes in Python
. This cause your generator to be copied instead. So your training loop will see the same data over and over again. While this is not that big of a deal in most cases, this cause the function model.predict_generator
to be useless for workers
> 1.
To solve this problem, I have been toying around with Pytorch
’s Dataset to add it into Keras
code base. In this post, I’ll share my solution that I hope, will be merged into Keras
soon enough.
Let’s start with some import
, we’ll need a ProcessPoolExecutor
to submit jobs to the different processes. A single thread-safe queue
is all what we need.
import os
import time
from concurrent.futures import ProcessPoolExecutor
from itertools import cycle
from queue import Queue
from threading import Thread
As explained earlier, I used Pytorch
’s Dataset object.
class Dataset():
def __getitem__(self, index):
raise NotImplementedError
def __len__(self):
raise NotImplementedError
For our tests, we shall create a fake Dataset. Here, the time.sleep(1)
is representing all the preprocessing tasks, reading a file, data augmentation, resizing. For the sake of this post, the dataset is an echo dataset. It returns what you give him.
class ExampleDataset(Dataset):
def __getitem__(self, index):
time.sleep(1)
return os.getpid(), index
def __len__(self):
return 100
The class MultiProcessExecutor
was made to replace GeneratorEnqueuer
from Keras
. While the constructor lacks many arguments, this is just to showcase the power of this object. The MultiProcessExecutor
will create a ProcessPoolExecutor
and when there is place in the queue
, it will submit a task to the executor with executor.submit.
Submitting a request will return a Future
object, you can wait on that object to get the result. The Future
objects are queued so that they will be read in order.
class MultiProcessExecutor():
def __init__(self, dataset):
self.workers = 5
self.executor = ProcessPoolExecutor(self.workers)
self.futures = {}
self.dataset = dataset
self.queue = Queue(self.workers * 2)
self.run_thread = None
def start(self):
self.run_thread = Thread(target=self.run)
self.run_thread.daemon = True
self.run_thread.start()
def run(self):
""" This will queue up 2*'workers' tasks in order """
indexes = cycle(range(len(self.dataset)))
for i in indexes:
self.queue.put(self.executor.submit(self.dataset.__getitem__, [i]), block=True)
def get_item(self):
while True:
yield self.queue.get(block=True).result()
To test our executor, we will ask for 100 items while timing it. executor.get_item()
is returning a generator that will wait for the queue and will return the Future
’s result.
dataset = ExampleDataset()
executor = MultiProcessExecutor(dataset)
executor.start()
getter = executor.get_item()
start = time.time()
for i in range(100):
result = next(getter)
print("Took executor",time.time()-start)
Took executor 20.045644760131836
Pretty much what we expected. 5 workers doing a 100 second task should reduce the time by 5.
Now, let’s compare it to Keras
’s GeneratorEnqueuer
. We’ll create a generator that is doing exactly the same thing as our ExampleDataset
.
from keras.engine.training import GeneratorEnqueuer
def keras_gen():
while True:
time.sleep(1)
yield os.getpid()
qu = GeneratorEnqueuer(keras_gen(),pickle_safe=True)
qu.start(5,10)
start = time.time()
for i in range(100):
while not qu.queue.qsize():
time.sleep(0.5)
result = qu.queue.get()
print("Took Keras",time.time()-start)
Took Keras 20.02438259124756
Pretty much the same as the executor. But, the data that GeneratorEnqueuer
will provide won’t be in order.
In this post, we’ve shown how to use an executor. In the future, this work should be integrated with Keras
model.*_generator
.
UPDATE This feature has been merged into Keras
2.0.5 and it is named Sequence.
DISCLAIMER If you can make any of the tests faster, please send me an e-mail! I’m genuinely interested.
I’ll be using Keras 2.0.0 with TensorFlow 1.0.1.
import operator
import threading
import time
from functools import reduce
import keras
import keras.backend as K
import numpy as np
import tensorflow as tf
from keras.layers import Conv2D
def prod(factors):
return reduce(operator.mul, factors, 1)
sess = tf.Session()
K.set_session(sess)
When using threads/processes you will probably need the GIL at some point! So here’s a decorator that make any generator thread-safe. I updated the code from anandalogy from Python2 to Python3. Thanks to him!
class threadsafe_iter:
"""Takes an iterator/generator and makes it thread-safe by
serializing call to the `next` method of given iterator/generator.
"""
def __init__(self, it):
self.it = it
self.lock = threading.Lock()
def __iter__(self):
return self
def __next__(self):
return self.next()
def next(self):
with self.lock:
return next(self.it)
def threadsafe_generator(f):
"""A decorator that takes a generator function and makes it thread-safe.
"""
def g(*a, **kw):
return threadsafe_iter(f(*a, **kw))
return g
Let’s create our problem now! Let’s say we have images with shape [200,200,3] and the ground truth are images with shape [12,12,80]. Feel free to modify any of the constants below.
For each test, we will start workers = 10
threads/processes. In every case, the queue will have a size of queue_size = 20
. To be able to run on pretty much any GPU the batch size will be of 10. TRAINING = True
is a constant to provide to Keras
when calling sess.run. This allows models using BatchNorm
(like keras.applications.Resnet50
) to run.
TRAINING = True
batch_size = 10
input_batch = [batch_size, 200, 200, 3]
output_batch = [batch_size, 12, 12, 80]
queue_size = 20
workers = 10
The main advantage of parallelism is to do multiple things at once. To show the great advantage of multiprocessing, the function to create our input will take more than 1 second to process. In your pipeline, this function would go fetch datas on disk, call a database, stream inside a HDF5 file, etc.
def get_input():
# Super long processing I/O
time.sleep(1)
return np.arange(prod(input_batch)).reshape(input_batch).astype(np.float32), np.arange(
prod(output_batch)).reshape(output_batch).astype(
np.float32)
For our experiment, using TensorFlow’s FIFOQueue will make everything simpler and is the first step to speed up our pipeline. More informations on queues here.
inp = K.placeholder(input_batch)
inp1 = K.placeholder(output_batch)
queue = tf.FIFOQueue(queue_size, [tf.float32, tf.float32], [input_batch, output_batch])
x1, y1 = queue.dequeue() # Tensors for the input and the ground truth.
enqueue = queue.enqueue([inp, inp1])
We then need a model. We’ll use VGG16 with an MAE loss trained with the RMSProp optimizer. I set the output of the network to be the same as our output shape [12,12,80]
. The model is not important for this blog post so feel free to change it.
model = keras.applications.VGG16(False, "imagenet", x1, input_batch[1:])
for i in range(3):
model.layers.pop()
model.layers[-1].outbound_nodes = []
model.outputs = [model.layers[-1].output]
output = model.outputs[0] # 12x12
output3 = Conv2D(5 * (4 + 11 + 1), (1, 1), padding="same", activation='relu')(
output)
cost = tf.reduce_sum(tf.abs(output3 - y1))
optimizer = tf.train.RMSPropOptimizer(0.001).minimize(cost)
sess.run(tf.global_variables_initializer())
To use the FIFOQueue we need a function to be used by every thread. It’s a simple function that will call our enqueue_op
with data from our get_input()
function.
def enqueue_data(coord, enqueue_op):
while not coord.should_stop():
inp_feed, inp1_feed = get_input()
sess.run(enqueue_op, feed_dict={inp: inp_feed, inp1: inp1_feed})
Let’s see what happens when we do not use parallelism. In this example, we bypass the queue by directly feeding the Tensors x1,y1
. We will process 30 batches for 10 epochs.
print("No threading")
start = time.time()
for i in (range(10)): # EPOCH
for j in range(30): # Batch
x, y = get_input()
optimizer_, s = sess.run([optimizer, queue.size()], feed_dict={x1: x, y1: y, K.learning_phase(): int(
TRAINING)})
print("Took : ", time.time() - start)
No threading
Took : 369.2351338863373
As expected, this takes more than 300 seconds since our get_input()
function is doing a 1 second sleep. Let’s use the FIFOQueue now.
We need to start threads to do the enqueuing. I’m not aware of any way to do this with processes, but any contribution is welcomed! Using the queue will save us one copy from Python to C++ because of the feed_dict. Since the queue is already in C++ we do not need to do the copy. I added some code to empty the queue at the end to let the threads die.
coordinator = tf.train.Coordinator()
threads = [threading.Thread(target=enqueue_data, args=(coordinator, enqueue)) for i in
range(workers)]
for t in threads:
t.start()
print("Queue")
start = time.time()
for i in (range(10)): # EPOCH
for j in range(30): # Batch
optimizer_, s = sess.run([optimizer, queue.size()],
feed_dict={K.learning_phase(): int(TRAINING)})
print("Took : ", time.time() - start)
def clear_queue(queue, threads):
while any([t.is_alive() for t in threads]):
_, s = sess.run([queue.dequeue(), queue.size()])
coordinator.request_stop()
clear_queue(queue, threads)
coordinator.join(threads)
print("DONE Queue")
Queue
Took : 61.429038286209106
DONE Queue
Great improvement! More than 5x speedup! Let’s compare it to the GeneratorEnqueuer which uses processes to feed a queue but is still using the feed_dict. The main drawback of this approach is that we need the GIL so that our generator do not get iterate over by multiple threads at the same time. Also, we need to sleep while the queue is empty.
from keras.engine.training import GeneratorEnqueuer
@threadsafe_generator
def get_generator():
while True:
yield get_input()
gen = get_generator()
enqueuer = GeneratorEnqueuer(gen, True)
enqueuer.start(max_q_size=queue_size, workers=workers)
time.sleep(1)
print("Keras enqueuer using multiprocess")
start = time.time()
for i in range(10): # EPOCH
for j in range(30): # Batch
while not enqueuer.queue.qsize():
time.sleep(0.5)
x, y = enqueuer.queue.get()
optimizer_ = sess.run([optimizer], feed_dict={x1: x, y1: y,
K.learning_phase(): int(
TRAINING)})
print("Took : ", time.time() - start)
enqueuer.stop()
Keras enqueuer using multiprocess
Took : 32.17211365699768
Amazing! We get a 2x speedup from the FIFOQueue and a 10x speedup from the original code!
The GeneratorEnqueuer can also be done with threads (by setting pickle_safe=False
) let’s see how it goes!
enqueuer = GeneratorEnqueuer(gen, False)
enqueuer.start(max_q_size=queue_size, workers=workers)
time.sleep(1)
print("Keras enqueuer using threads")
start = time.time()
for i in (range(10)): # EPOCH
for j in range(30): # Batch
while not enqueuer.queue.qsize():
time.sleep(0.5)
x, y = enqueuer.queue.get()
optimizer_ = sess.run([optimizer], feed_dict={x1: x, y1: y,
K.learning_phase(): int(
TRAINING)})
print("Took : ", time.time() - start)
enqueuer.stop()
Keras enqueuer using threads
Took : 301.63300943374634
So it’s better than nothing, but the GIL is really killing everything.
Method | Time (s) |
---|---|
No threading | 369 |
FIFOQueue | 61 |
Keras multiprocessed | 32 |
Keras threaded | 301 |
So we saw how to speed up an input pipeline. While FIFOQueues are a big speed up, multiprocessing can speed up the pipeline even more! TFRecord should soon be supported by Keras and I’ll update this post when it’s the case. Also, there should be a way to feed a FIFOQueue using multiprocessing.
]]>