Introduction
Flux (by Black Forest Labs) has been gaining popular and taking some shares from Midjourney and the main Stable Diffusion models in recent months, proving that the Generative AI image space is still evolving. I will write up my experience of finetuning Flux with a small sample of image datasets (about 15 .png files). The goal is to add a specific object so that we can “insert” it in future image generations.
Sections
- Setup an account Replicate.com
- Use a template Flux LoRA trainer
- Collect dataset
- Start training
- Test Results
- Final Thoughts
1: Replicate.com
There are numerous online services to do the same thing. I have tried quite a few (e.g., ThinkDiffusion / RunDiffusion that are specific for image models; as well as plain virtual machines like Lightning.ai and Runpod) but for this article I will use Replicate. Its UI is simple / no-frill whereas many other UI / services make things more complicated than necessary (to the point where it might be easier to write PyTorch code). Setup an account and fund it some maybe $10. The training steps below should only cost maybe $2.
2: Use Flux LoRA Trainer Template
We will use this template: https://replicate.com/ostris/flux-dev-lora-trainer/train — it has a good README and blog post for more details. Essentially, this lets us supply the training dataset, input our desired parameters, and then do the training (or finetuneing, to be precise, as we aren’t changing the weights of Flux itself but adding Low-Ranked Adaptation (LoRA) as an add-on). It will output a model for us which basically includes the base Flux model plus these additional LoRA weights. Here’s a screenshot:
(Under the hood, this template is really making a container (cog) for us that runs the training process on a virtual machine setup with Nvidia H100 GPU and with behind-the-scene code in PyTorch and other scripts / utilities)
3: Collect Dataset
Perhaps the most important step is to gather the datasets. In this example, I want the ability to insert an object in future images I make with Flux. So I downloaded about 15 images for an old product I have, the Jawbone Jambox. I try to just get images for the same model but with different compositions / angles and keep it simple with the red model. Here are some of them. Note I save the files as jambox1, jambox2, etc. in a folder say {training_data_jambox}
on my Mac.
When you grab things from Google Images, try to get higher resolution (1024x1024 or higher ideally) — you can click the little [Tools] button in Google and pick [Large] to filter out the images. And pick the color [Red] to save more time.
Optionally, you can add a simple caption file for each image. For example “front view of Jambox,” “Jambox on a table,” etc. To do so, put caption of a image in a corresponding jambox1.txt, jambox2.txt file, etc. Put these .txt files in the same {training_data_jambox}
folder.
Many of the images on the Internet are in .JPG, .JPEG, .WEBP, etc. I prefer converting them to the same format and specifically .PNG for consistency (although the trainer doesn’t seem to care). You can do it manually or write a simple script to convert them. Here’s the one I used below. To run it, open terminal / shell and run > python convert_png.py
. It should take every one of your file and make a .png for it.
convert_png.py
from PIL import Image
import os
def convert_to_png(input_folder, output_folder=None):
"""
Convert all JPG, JPEG, and WebP images in the input folder to PNG format.
Args:
input_folder (str): Path to the folder containing images
output_folder (str, optional): Path to save converted images. If None,
saves in the same folder as input
"""
# If no output folder specified, use input folder
if output_folder is None:
output_folder = input_folder
# Create output folder if it doesn't exist
if not os.path.exists(output_folder):
os.makedirs(output_folder)
# Supported input formats
supported_formats = ('.jpg', '.jpeg', '.webp')
# Process each file in the input folder
for filename in os.listdir(input_folder):
if filename.lower().endswith(supported_formats):
# Open the image
input_path = os.path.join(input_folder, filename)
try:
with Image.open(input_path) as img:
# Create output filename
output_filename = os.path.splitext(filename)[0] + '.png'
output_path = os.path.join(output_folder, output_filename)
# Convert and save as PNG
img.save(output_path, 'PNG')
print(f'Converted: {filename} -> {output_filename}')
except Exception as e:
print(f'Error converting {filename}: {str(e)}')
if __name__ == "__main__":
# Example usage
folder_path = "./images" # Replace with your folder path
convert_to_png(folder_path)
Finally, compress that folder into a .zip file. So you should have {your_folder_name}.zip
ready to go for the next step.
4: Training
Now, back to the trainer website https://replicate.com/ostris/flux-dev-lora-trainer/train — the following are a few fields we need to provide.
[Destination] — it’s a new model for you, so choose “create new model” — here’s my setting:
[input_images] — drop your .zip file here.
[trigger_word] — this is important: once you successfully train this LoRA model, how do you call it? In other words, when you want the Jambox to show up, how can you tell the newly trained model? [trigger_word] is like a callout to make sure that the new model understands your intent. The default is “TOK” which is fine, but for me I used JAMBOX.
[autocaption] — check if you don’t have the .txt caption files. The trainer will call another image2txt model to scan each image and return a caption. If the results at the end aren’t good, you can go back and add a jambox1.txt, etc. file to provide the caption of each image.
Everything else, we can leave as empty / default for now without going into the weeds.
Below are what you will see in the log window. First, it does final prep of the dataset. Notice how it does the auto caption per image. It mostly recognizes a “speaker” but some of them notices the “Jawbone” logo:
It will then start the actual training process:
It will take 10-20 minutes through 1000 steps of the neural network / backpropogation training process. Good time to have a coffee break…
When it’s done, you should see a Succeeded flag. In my case, this took 18m 30s to run.
5: Test Results
Finally, we can go test and see if it works. Click on the [Run trained model] button above. You should now see the following page. (If you wanna play with mine first, here’s the model: https://replicate.com/chakify/flux-dev-lora-jambox)
Recall that we provided the trigger_word
parameter to be “JAMBOX” — so it has to be included in the prompt in order for the model to include your LoRA and try to add the Jambox. If not, you are essentially making an image with the base Flux model.
Here are some results I did with super simple prompts:
Prompt: “JAMBOX on the coffee table”
Prompt: “Red JAMBOX on the coffee table”
Prompt: “Green JAMBOX on the coffee table”
Given my dataset only has RED Jambox images, why do I say “Red JAMBOX” in my prompt, you ask? It turns out when I started saying just JAMBOX, it gave me something in a different color — that’s part of the surprise element of Gen AI models (for better or for worse). For example, here it made a black one (or two!):
I won’t show all the examples I tried. But there are numerous funny ones like this one: Prompt: “Family at the beach. Red JAMBOX in the middle of the beach towel.“
This one has wrong dimensions but shows promises for “product placement” use cases:
Prompt: “Red Jambox in the palm of a teenager.”
Prompt: “Red JAMBOX in the middle of the bookselves, in an office with black bookshelves.”
Flux is famous for inserting text into images. Here’s a test that kinda works…
Prompt: “Red JAMBOX on a table. The word "Flux" is written on it.”
Overall, the results are still hits and misses, especially when it comes to the shape of the Jambox. So it’s time to play with some of the model parameters (before we have to go gather more training data and try again).