[Stable Diffusion] AIで複数のテキストから画像を生成する

本記事では、mixture-of-diffusersと呼ばれる機械学習手法を用いて複数のテキストプロンプトを用いて画像を生成する方法をご紹介します。

Mixture of diffusers

概要

Mixture of diffusersは、複数のプロセスを用いて、統一的な1枚の画像を作成するための画像生成ツールです。

上図の画像は、4つのテキストプロンプトを用いて生成した画像です。

本記事では上記手法を用いて、複数のテキストから1枚の画像を生成していきます。

デモ(Colaboratory)

それでは、実際に動かしながら画像を生成していきます。
ソースコードは本記事にも記載していますが、下記のGitHubでも取得可能です。
GitHub - Colaboratory demo

また、下記から直接Google Colaboratoryで開くこともできます。

なお、このデモはPythonで実装しています。
Pythonの実装に不安がある方、Pythonを使った機械学習について詳しく勉強したい方は、以下の書籍やオンライン講座などがおすすめです。

おすすめの書籍

おすすめのオンライン講座

環境セットアップ

それではセットアップしていきます。 Colaboratoryを開いたら下記を設定しGPUを使用するようにしてください。

「ランタイムのタイプを変更」→「ハードウェアアクセラレータ」をGPUに変更

初めにGithubからソースコードを取得します。

%cd /content

!git clone https://github.com/albarji/mixture-of-diffusers.git

# using Commits on Feb 12, 2023
%cd /content/mixture-of-diffusers
!git checkout a0a1ac6d71aaa91e0aec73336eeb7ddc4f08fe21

次にライブラリをインストールします。

%cd /content/mixture-of-diffusers

!pip install diffusers[torch]==0.7.* ftfy==6.1.* gitpython==3.1.* ligo-segments==1.4.* transformers==4.21.*

最後にライブラリをインポートします。
なおaccess_tokenにHuggingFaceで発行したアクセストークンを設定してください。

アクセストークンの取得方法は以下をご参照ください。

%cd /content/mixture-of-diffusers

from diffusers import LMSDiscreteScheduler
from mixdiff.tiling import (
    StableDiffusionTilingPipeline)
from mixdiff.canvas import (
    StableDiffusionCanvasPipeline, Text2ImageRegion)

import torch
device = 'cuda' if torch.cuda.is_available() else "cpu"
print("using device is", device)

access_token = "ここにHuggingFaceのAccessTokenを設定"

以上で環境セットアップは完了です。

学習済みモデルのセットアップ

ここでは、HuggingFaceからStable Diffusionをダウンロードしていきます。

model_id = "CompVis/stable-diffusion-v1-4"

# load scheduler
scheduler = LMSDiscreteScheduler(
    beta_start=0.00085, 
    beta_end=0.012, 
    beta_schedule="scaled_linear", 
    num_train_timesteps=1000)

# load model
pipeline = StableDiffusionTilingPipeline.from_pretrained(
    model_id, 
    scheduler = scheduler, 
    use_auth_token = access_token
    ).to(device)

TilingPipeling

それでは、複数プロンプトから画像を生成していきます。

まずプロンプトを設定します。

prompt = [
    "kids in the park, ukiyo-e, intricate, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece, impressive colors",
    "kids in the park, Van Gogh, intricate, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece, impressive colors",
    "kids in the park, modern anime, intricate, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece, impressive colors",
    "kids in the park, Banksy, intricate, elegant, highly detailed, smooth, sharp focus, artstation, stunning masterpiece, impressive colors",
]

続いて、プロンプトに従い画像を生成していきます。

# num_inference_steps: number of diffusions steps.
# guidance_scale: classifier-free guidance.
# seed: general random seed to initialize latents.
# tile_height: height in pixels of each grid tile.
# tile_width: width in pixels of each grid tile.
# tile_row_overlap: number of overlap pixels between tiles in consecutive rows.
# tile_col_overlap: number of overlap pixels between tiles in consecutive columns.
# guidance_scale_tiles: specific weights for classifier-free guidance in each tile.
# guidance_scale_tiles: specific weights for classifier-free guidance in each tile. If None, the value provided in guidance_scale will be used.
# seed_tiles: specific seeds for the initialization latents in each tile. These will override the latents generated for the whole canvas using the standard seed parameter.
# seed_tiles_mode: either "full" "exclusive". If "full", all the latents affected by the tile be overriden. If "exclusive", only the latents that are affected exclusively by this tile (and no other tiles) will be overrriden.
# seed_reroll_regions: a list of tuples in the form (start row, end row, start column, end column, seed) defining regions in pixel space for which the latents will be overriden using the given seed. Takes priority over seed_tiles.

image = pipeline(
    prompt=[
        prompt
    ],
    tile_height = 640,
    tile_width = 640,
    tile_row_overlap = 0,
    tile_col_overlap = 256,
    guidance_scale = 8,
    seed = 12,
    num_inference_steps = 50,
)["sample"][0]

image

出力結果は以下の通りです。
4つのプロンプトの画像を横並びに自然につなげて1枚の画像を生成しています。

まとめ

本記事では、Mixture of diffusersを用いて複数プロンプトから1枚の画像を生成する方法をご紹介しました。

また本記事では、機械学習を動かすことにフォーカスしてご紹介しました。
もう少し学術的に体系立てて学びたいという方には以下の書籍などがお勧めです。ぜひご一読下さい。

リンク

また動かせるだけから理解して応用できるエンジニアの足掛かりに下記のUdemyなどもお勧めです。

参考文献

1. GitHub - albarji/mixture-of-diffusers

[Stable Diffusion] AIで複数のテキストから画像を生成する

Mixture of diffusers

概要

デモ(Colaboratory)

環境セットアップ

学習済みモデルのセットアップ

TilingPipeling

まとめ

参考文献

2 件のコメント :

AIで副業ならココから!

まずは無料会員登録

プロフィール

注目の投稿

[初心者向け] 機械学習がゼロから分かるおすすめオンライン講座

人気の投稿

カテゴリ

このブログを検索

ブログアーカイブ

注目の投稿

[初心者向け] Pythonで機械学習を始めるまでに読んだおすすめ書籍一覧

このブログについて

TeDokology

連絡フォーム

このブログについて

[Stable Diffusion] AIで複数のテキストから画像を生成する

Mixture of diffusers

概要

デモ(Colaboratory)

環境セットアップ

学習済みモデルのセットアップ

TilingPipeling

まとめ

参考文献

2 件のコメント :

AIで副業ならココから!

まずは無料会員登録

プロフィール

注目の投稿

[初心者向け] 機械学習がゼロから分かるおすすめオンライン講座

人気の投稿

カテゴリ

このブログを検索

ブログ アーカイブ

注目の投稿

[初心者向け] Pythonで機械学習を始めるまでに読んだおすすめ書籍一覧

このブログについて

TeDokology

連絡フォーム

このブログについて

ブログアーカイブ