How to make an ai image generator

How I built an AI Text-to-Art Generator

A detailed, step-by-step write-up on how I built

Text2Art Gallery of Generated Arts [Image by Author]Overview

This article is a write-up on how I built in a week. Text2Art is an AI-powered art generator based on VQGAN+CLIP that can generate all kinds of art such as pixel art, drawing, and painting from just text input. The article follows my thought process from experimenting with VQGAN+CLIP, building a simple UI with Gradio, switching to FastAPI to serve the models, and finally to using Firebase as a queue system. Feel free to skip to the parts that you are interested in.

You can try it at and here is the source code (feel free to star the repo)

Text2Art Demo (UPDATE: we have 1.5K+ users now)Outline

  • Introduction
  • How It Works
  • Generating Art with VQGAN+CLIP with Code
  • Making UI with Gradio
  • Serving ML with FastAPI
  • Queue System with Firebase


Not long ago, generative arts and NFT took the world by storm. This is made possible after OpenAI significant progress in text-to-image generation. Earlier this year, OpenAI announced DALL-E, a powerful text-to-image generator that works extremely well. To illustrate how well DALL-E worked, these are DALL-E generated images with the text prompt of “a professional high quality illustration of a giraffe dragon chimera. a giraffe imitating a dragon. a giraffe made of dragon”.

mages produced by DALL-E when given the text prompt “a professional high quality illustration of a giraffe dragon chimera. a giraffe imitating a dragon. a giraffe made of dragon.” [Image by OpenAI with MIT license]

Unfortunately, DALL-E was not released to the public. But luckily, the model behind DALL-E’s magic, CLIP, was published instead. CLIP or Contrastive Image-Language Pretraining is a multimodal network that combines text and images. In short, CLIP is able to score how well an image matched a caption or vice versa. This is extremely useful in steering the generator to produce an image that exactly matches the text input. In DALL-E, CLIP is used to rank the generated images and output the image with the highest score (most similar to text prompt).

Example of CLIP scoring images and captions [Image by Author]

Few months after the announcement of DALL-E, a new transformer image generator called VQGAN (Vector Quantized GAN) was published. Combining VQGAN with CLIP gives a similar quality to DALL-E. Many amazing arts have been created by the community since the pre-trained VQGAN model was made public.

Painting of city harbor night view with many ships, Painting of refugee in war. [Image Generated by Author]

I was really amazed at the results and wanted to share this with my friends. But since not many people are willing to dive into the code to generate the arts, I decided to make, a website where anyone can simply type a prompt and generate the image they want quickly without seeing any code.

How It Works

So how does VQGAN+CLIP work? In short, the generator will generate an image and the CLIP will measure how well the image matches the image. Then, the generator uses the feedback from the CLIP model to generate more “accurate” images. This iteration will be done many times until the CLIP score becomes high enough and the generated image matches the text.

The VQGAN model generates images while CLIP guides the process. This is done throughout many iterations until the generator learns to produce more “accurate” images. [Source: The Illustrated VQGAN by LJ Miranda]

I won’t discuss the inner working of VQGAN or CLIP here as it’s not the focus of this article. But if you want a deeper explanation on VQGAN, CLIP, or DALL-E, you can refer to these amazing resources that I found.


VQGAN+CLIP is simply an example of what combining an image generator with CLIP is able to do. However, you can replace VQGAN with any kind of generator and it can still work really well depending on the generator. Many variants of X + CLIP have come up such as StyleCLIP (StyleGAN + CLIP), CLIPDraw (uses vector art generator), BigGAN + CLIP, and many more. There is even AudioCLIP which uses audio instead of images.

Image editing with StyleCLIP [Source: StyleCLIP Paper]Generating Art with VQGAN+CLIP with Code

I’ve been using the code from clipit repository by dribnet which made generating art using VQGAN+CLIP into a simple few lines of code only (UPDATE: clipit has been migrated to pixray).

It is recommended to run this on Google Colab as VQGAN+CLIP requires quite a lot GPU memory. Here is a Colab notebook that you can follow along.

First of all, if you are running on Colab, make sure you change the runtime type to use GPU.

Steps to change Colab runtime type to GPU. [Image by Author]

Next, we need to set up the codebase and the dependencies first.

from IPython.utils import io
with io.capture_output() as captured:
!git clone
# !pip install taming-transformers
!git clone
!rm -Rf clipit
!git clone
!pip install ftfy regex tqdm omegaconf pytorch-lightning
!pip install kornia
!pip install imageio-ffmpeg
!pip install einops
!pip install torch-optimizer
!pip install easydict
!pip install braceexpand
!pip install git+

# ClipDraw deps
!pip install svgwrite
!pip install svgpathtools
!pip install cssutils
!pip install numba
!pip install torch-tools
!pip install visdom

!pip install gradio

!git clone
%cd diffvg
# !ls
!git submodule update --init --recursive
!python install
%cd ..

!mkdir -p steps
!mkdir -p models

(NOTE: “!” is a special command in google Colab that means it will run the command in bash instead of python”)

Once we installed the libraries, we can just import clipit and run these few lines of code to generate your art with VQGAN+CLIP. Simply change the text prompt with whatever you want. Additionally, you can also give clipit options such as how many iterations, width, height, generator model, whether you want to generate video or not, and many more. You can read the source code for more information on the available options.

Code for generating art with VQGAN+CLIP

Once you run the code, it will generate an image. For each iteration, the generated image will be closer to the text prompt.

Result improvement based on longer iterations for “underwater city”. [Image by Author]

Longer Iterations

If you want to generate with a longer iteration, simply use the iterations option and set it as long as you want. For example, if you want to it to run for 500 iterations.


Generating Video

Since we need to generate the image for each iteration anyway, we can save these images and create an animation on how the AI generates the image. To do this, you can simply add the make_video=True before applying the settings.


It will generate the following video.

Generated “Underwater City” GIF [Image by Author]

Customizing Image Size

You can also modify the image by adding the size=(width, height)option. For example, we will generate a banner image with 800×200 resolution. Note that higher resolution will require higher GPU memory.

clipit.add_settings(size=(800, 200))

Generated 800×200 image with the prompt “Fantasy Kingdom #artstation” [ Image by Author]

Generating Pixel Arts

There is also an option to generate pixel art in clipit. It uses the CLIPDraw renderer behind the scene with some engineering to force pixel art style such as limiting palette colors, pixelization, etc. To use the pixel art option, simply enable the use_pixeldraw=True option.


Generated image with the prompt “Knight in armor #pixelart” (left) and “A world of chinese fantasy video game #pixelart” (right) [Image by Author]

VQGAN+CLIP Keywords Modifier

Due to the bias in CLIP, adding certain keywords to the prompt may give a certain effect to the generated image. For example, adding “unreal engine” to the text prompt tends to generate a realistic or HD style. Adding certain site names such as “deviantart”, “artstation” or “flickr” usually makes the results more aesthetic. My favorite is to use “artstation” keyword as I find it generates the best art.

Keywords comparison [Image by kingdomakrillic]

Additionally, you can also use keywords to condition the art style. For example, the keywords “pencil sketch”, “low poly” or even artist’s name such as “Thomas Kinkade” or “James Gurney”.

Artstyle Keyword Comparison. [Image by kingdomakrillic]

To explore more on the effect of various keywords, you can check out the full experiment results by kingdomakrillic which shows 200+ keywords results using the same 4 subjects.

Building UI with Gradio

My first plan on deploying an ML model is to use Gradio. Gradio is a python library that simplifies building ML demos into a few lines of code only. With Gradio, you can build a demo in less than 10 minutes. Additionally, you can run the Gradio in Colab and it will generate a sharable link using Gradio domain. You can instantly share this link with your friends or the public to let them try out your demo. Gradio still has some limitations but I find it’s the most suitable library to use when you just want to demonstrate a single function.

Gradio UI [Image by Author]

So here is the code that I wrote to build a simple UI for the Text2Art app. I think the code is quite self-explanatory, but if you need more explanation, you can read the Gradio documentation.

Code to build the Gradio UI

Once you run this in Google Colab or local, it will generate a shareable link that makes your demo accessible public. I find this extremely useful as I don’t need to use SSH tunneling like Ngrok on my own to share my demo. Additionally, Gradio also offers a hosting service where you can permanently host your demo for only 7$/month.

Shareable link of the Gradio demo. [Image by Author]

However, Gradio only works well for demoing a single function. Creating a custom site with additional features like gallery, login, or even just custom CSS is fairly limited or not possible at all.

One quick solution I could think of is by creating my demo site separate from the Gradio UI. Then, I can embed the Gradio UI on the site using the iframe element. I initially tried this method but then realized one important drawback, I cannot personalize any parts that need to interact with the ML app itself. For example, things such as input validation, custom progress bar, etc are not possible with iframe. This is when I decided to build an API instead.

Serving ML Model with FastAPI

I’ve been using FastAPI instead of Flask to quickly build my API. The main reason is I find FastAPI is faster to write (less code) and it also auto-generates documentation (using Swagger UI) that allows me to test the API with basic UI. Additionally, FastAPI supports asynchronous functions and is said to be faster than Flask.

Accessing Swagger UI by adding /docs/ in the URL [Image by Author]Testing API in Swagger UI [Image by Author]

Here is the code I wrote to serve my ML function as FastAPI server.

Code for API server

Once we defined the server, we can run it using uvicorn. Additionally, because Google Colab only allows access to their server through the Colab interface, we have to use Ngrok to expose the FastAPI server to the public.

Code to run and expose the server

Once we run the server, we can head to the Swagger UI (by adding /docs on the generated ngrok URL) and test out the API.

Generating “Underwater Castle” using FastAPI Swagger UI [Image by Author]

While testing the API, I realized that the inference can takes about 3–20 mins depending on the quality/iterations. 3 mins itself is already considered very long for HTTP request and users may not want to wait that long on the site. I decided that setting the inference as a background task and emailing the user once the result is done might be more suitable for the task due to the long inference time.

Now that we decided on the plan, we first will write the function to send the email. I initially use SendGrid email API to do this, but after running out of the free usage quota (100 emails/day), I switched to Mailgun API since they are part of the GitHub Student Developer Pack and allows 20,000 emails/month for students.

So here is the code to send an email with an image attachment using Mailgun API.

Code for sending an email with Mailgun API

Next, we will modify our server code to use background tasks in FastAPI and send the result through email in the background.

With the code above, the server will quickly reply to the request with the “Task is processed in the background” message instead of waiting for the generation process to finish and replying with the image.

Once the process is finished, the server will send the result by emailing the user.

Image and video results emailed to the user. [Image by author]

Now that everything seems to be working, I built the front end and shared the site with my friends. However, I found that there was a concurrency problem when testing it out with multiple users.

When a second user makes a request to the server while the first task is still processing, somehow the second task will terminate the current process instead of creating a parallel process or queueing. I was not sure what caused this, maybe it was the use of global variables in the clipit code or maybe not. I did not spend too much time debugging it as I realized that I need to implement a message queue system instead.

After a few google searches on the message queue system, most recommend RabbitMQ or Redis. However, I was not sure whether RabbitMQ or Redis can be installed on Google Colab as it seems to require sudo permission. In the end, I decided to use Google Firebase as a queue system instead as I wanted to finish the project ASAP and Firebase is the one I’m most familiar with.

Basically, when the user tries to generate an art in the frontend, it will add an entry in a collection namedqueue describing the task (prompt, image type, size, etc). On the other hand, we will run a script on Google Colab that continuously listens for a new entry in the queue collection and processes the task one by one.

Backend code that processes the task and listens to the queue continuously

In the front end, we only have to add a new task in the queue. But make sure you have done a proper Firebase setup on your front end.

prompt: prompt,
email: email,
quality: quality,
type: type,
aspect: aspect,
created_at: firebase.firestore.FieldValue.serverTimestamp(),

And it’s done! Now, when a user tries to generate art in the frontend, it will add a new task in the queue. The worker script in the Colab server will then process the tasks in the queue one by one.

You can check out the GitHub repo to see the full code (feel free to star the repo).

Adding new task in queue on frontend [Image by Author]Queue content in the Firebase [Image by Author]

Written by Jane