Cost-effective image generation of it with Pixart-Sigma conclusions in AWS Trainium and AWS inferentia
Pixart-Sigma is a model of the diffusion transformer that is capable of generating image in 4K resolution. This model shows significant improvements to the previous generation Pixart models such as Pixart-Alfa and other diffusion models through data and architectural improvements. AWS Trainium and AWS inferentia are chips built in order to accelerate machinery workloads (ML), making them ideal for cost effective placement of large generating models. Using these chips he, you can achieve optimal performance and efficiency when executing conclusions on diffusion transformer models like Pixart-Sigma.
This post is the first in a series where we will lead multiple diffusion transformers to the train and inferenci instances. In this post, we show how you can put Pixart-Sigma in the train and infertility instances.
Settlement
The steps described below will be used to place the Pixart-Sigma model in the AWS Trainium and execute the conclusion on it to generate high quality images.
- Step 1-Parakules and Configuration
- Step 2-discharge and tap Pixart-Sigma model for AWS Trainium
- Step 3 – Place the pattern on the AWS Trainium to generate images
Step 1 – Prerequisites and Configuration
To begin with, you will need to create a development environment in a Trn1, TRN2 or INF2 host. Complete the following steps:
- Launching one
trn1.32xlarge
ORtrn2.48xlarge
Example with a neuron Dlami. For instructions on how to start, refer to start with Neuron at Ubuntu 22 with the multi-conorns neuron Dlami. - Start a Jupyter Sever notebook. For instructions to set a Jupyter server, refer to the following user guide.
- Clonone AWS-NEURON-Samples Github:
- Navigate in hf_preTreain_pixart_sigma_1k_latetency_optimized.ipynb Notebook:
Writing the example given is created to execute in a TRN2 example, but you can adjust it for cases TRN1 or INF2 with minimal modifications. Specifically, inside the notebooks and in each of the constituent files under neuron_pixart_sigma
The Directorate, you will find commented changes to accommodate the TRN1 or INF2 configurations.
Step 2-discharge and tap Pixart-Sigma model for AWS Trainium
This section offers a step-by-step guide to the Pixart-Sigma compilation for AWS Trainium.
Download the model
You will find an auxiliary function in the cache-hf-model.py in the depot mentioned above Github which shows how to download the pixart-sigma model from the facial embracing. If you are using Pixart-Sigma on your workload, and decide not to use the scenario included in this post, you can use Huggingface-CLI to download the pattern in place.
Implementation of Neuron Pixart-Sigma contains several writings and classes. Different files and scripts are divided as follows:
├── compile_latency_optimized.sh # Full Model Compilation script for Latency Optimized
├── compile_throughput_optimized.sh # Full Model Compilation script for Throughput Optimized
├── hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb # Notebook to run Latency Optimized Pixart-Sigma
├── hf_pretrained_pixart_sigma_1k_throughput_optimized.ipynb # Notebook to run Throughput Optimized Pixart-Sigma
├── neuron_pixart_sigma
│ ├── cache_hf_model.py # Model downloading Script
│ ├── compile_decoder.py # Text Encoder Compilation Script and Wrapper Class
│ ├── compile_text_encoder.py # Text Encoder Compilation Script and Wrapper Class
│ ├── compile_transformer_latency_optimized.py # Latency Optimized Transformer Compilation Script and Wrapper Class
│ ├── compile_transformer_throughput_optimized.py # Throughput Optimized Transformer Compilation Script and Wrapper Class
│ ├── neuron_commons.py # Base Classes and Attention Implementation
│ └── neuron_parallel_utils.py # Sharded Attention Implementation
└── requirements.txt
This notebook will help you download the model, compile individual ingredients of the ingredients, and call the generation pipeline to generate an image. Although notebooks can be executed as an independent sample, the next parts of this post will traverse the key details of the implementation within the files and component scripts to support the Pixart-Sigma direction in Neuron.
For each pixart component (T5, transformer and VAE), the example uses classes of the specific neuron wrapper. These classes of wrappers serve two purposes. The first goal is that it allows us to track the models for compilation:
class InferenceTextEncoderWrapper(nn.Module):
def __init__(self, dtype, t: T5EncoderModel, seqlen: int):
super().__init__()
self.dtype = dtype
self.device = t.device
self.t = t
def forward(self, text_input_ids, attention_mask=None):
return (self.t(text_input_ids, attention_mask)('last_hidden_state').to(self.dtype))
Please refer to the neuron_Commons.py file for all modules and classes of the wrapper.
The second reason for using wrapping classes is to modify the implementation of attention to run in Neuron. Because diffusion patterns like pixart are usually connected, you can improve performance by grafting the attention layer through multiple devices. To do this, you replace the linear layers with the Rowparalllinear layers of Neuronx Distributed and Columnparalllinear:
def shard_t5_self_attention(tp_degree: int, selfAttention: T5Attention):
orig_inner_dim = selfAttention.q.out_features
dim_head = orig_inner_dim // selfAttention.n_heads
original_nheads = selfAttention.n_heads
selfAttention.n_heads = selfAttention.n_heads // tp_degree
selfAttention.inner_dim = dim_head * selfAttention.n_heads
orig_q = selfAttention.q
selfAttention.q = ColumnParallelLinear(
selfAttention.q.in_features,
selfAttention.q.out_features,
bias=False,
gather_output=False)
selfAttention.q.weight.data = get_sharded_data(orig_q.weight.data, 0)
del(orig_q)
orig_k = selfAttention.k
selfAttention.k = ColumnParallelLinear(
selfAttention.k.in_features,
selfAttention.k.out_features,
bias=(selfAttention.k.bias is not None),
gather_output=False)
selfAttention.k.weight.data = get_sharded_data(orig_k.weight.data, 0)
del(orig_k)
orig_v = selfAttention.v
selfAttention.v = ColumnParallelLinear(
selfAttention.v.in_features,
selfAttention.v.out_features,
bias=(selfAttention.v.bias is not None),
gather_output=False)
selfAttention.v.weight.data = get_sharded_data(orig_v.weight.data, 0)
del(orig_v)
orig_out = selfAttention.o
selfAttention.o = RowParallelLinear(
selfAttention.o.in_features,
selfAttention.o.out_features,
bias=(selfAttention.o.bias is not None),
input_is_parallel=True)
selfAttention.o.weight.data = get_sharded_data(orig_out.weight.data, 1)
del(orig_out)
return selfAttention
Please refer to the neuron_parallel_utiles.py file for more details on parallel attention.
Compile individual sub-models
The Pixart-Sigma model is made up of three ingredients. Eachdo component is compiled so that the entire generation pipeline can operate in the neuron:
- Encoder Text-a 4 billion parameter coding, which translates a man-read fast into a embedded. On the text coder, the attention layers are chopped, along with the food layers forward, with the tender parallelism.
- The transformer model denoing-a 700 million parameter transformer, which repeatedly denounces a latent (a numerical representation of a compressed image). In the transformer, the attention layers are chopped, along with the food layers in front, with the tender parallelism.
- Decoder-A Vae decoder that transforms our latent created by the denoiser into an exit image. For the decipher, the model is set with data parallelism.
Now that the definition of the model is ready, you need to track a model to direct it to the train or infertility. You can see how to use trace()
Function to compile the Pixart decipher component model in the following code block:
compiled_decoder = torch_neuronx.trace(
decoder,
sample_inputs,
compiler_workdir=f"{compiler_workdir}/decoder",
compiler_args=compiler_flags,
inline_weights_to_neff=False
)
Please refer to the Compile_Decoder.py file for more than how to lead instantly and compile the decipher.
To execute patterns with tensor parallelism, a technique used to divide a tension into pieces across multiple neuroncorers, you must track with a predetermined tp_degree
. it tp_degree
Specifies the number of neuroncores to destroy the pattern throughout. Then use parallel_model_trace
API to compile the models of the codifier and transformer component for pixart:
compiled_text_encoder = neuronx_distributed.trace.parallel_model_trace(
get_text_encoder_f,
sample_inputs,
compiler_workdir=f"{compiler_workdir}/text_encoder",
compiler_args=compiler_flags,
tp_degree=tp_degree,
)
Please refer to the Compile_text_encoder.py file for more details on tracking the coder with tensor parallelism.
Finally, you trace the transformer model with tensor parallelism:
compiled_transformer = neuronx_distributed.trace.parallel_model_trace(
get_transformer_model_f,
sample_inputs,
compiler_workdir=f"{compiler_workdir}/transformer",
compiler_args=compiler_flags,
tp_degree=tp_degree,
inline_weights_to_neff=False,
)
Please refer to the Compile_Transformer_latetency_optimized file. For more details on tracking of transformer with tensor parallelism.
You will use compile_latent_optimized.sh script to compile all three models as described in this post, so these functions will be executed automatically when executing through the notebooks.
Step 3 – Place the pattern on the AWS Trainium to generate images
This section will cross the steps to execute the conclusion at the Pixart-Sigma in the AWS Trainium.
Create a distributor pipeline object
The library of hugging facial distributors is a library for pre-trained diffusion patterns, and includes specific pipelines that accumulate ingredients (models, models and processors independently) needed to execute a diffusion model. PixArtSigmaPipeline
It is specific to the PixartSigma model, and is instant as follows:
pipe: PixArtSigmaPipeline = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
torch_dtype=torch.bfloat16,
local_files_only=True,
cache_dir="pixart_sigma_hf_cache_dir_1024")
Please refer to the HF_PretraTraineed_pixart_sigma_1k_LATTION_OPTIMIZE.ipynb notebook for details on pipeline execution.
Models of compiled load components in generation pipeline
Once each component model has been compiled, load them into the general generation pipeline pipeline. The VAE model is loaded with data parallelism, which allows us to parallel the image generation for series size or numerous images for fast. For more details, refer to the Notebook -ut hf_pretrainead_pixart_sigma_1k_latetency_optimized.ipynb.
vae_decoder_wrapper.model = torch_neuronx.DataParallel(
torch.jit.load(decoder_model_path), (0, 1, 2, 3), False
)
text_encoder_wrapper.t = neuronx_distributed.trace.parallel_model_load(
text_encoder_model_path
)
Finally, loaded models are added to the generation pipeline:
pipe.text_encoder = text_encoder_wrapper
pipe.transformer = transformer_wrapper
pipe.vae.decoder = vae_decoder_wrapper
pipe.vae.post_quant_conv = vae_post_quant_conv_wrapper
Compose
Now that the model is ready, you can write a quick to convey what kind of image you want to generate. When creating a quick, you must always be as specific as possible. You can use a positive speed to convey what is required in your new image, including a theme, action, style and location, and you can use a negative quick to indicate the features that need to be removed.
For example, you can use the following positive and negative requirements to generate a photo of an astronaut riding a horse in March without mountains:
# Subject: astronaut
# Action: riding a horse
# Location: Mars
# Style: photo
prompt = "a photo of an astronaut riding a horse on mars"
negative_prompt = "mountains"
Feel free to edit quickly to your notebook using quick engineering to generate an image of your choice.
Generate an image
To generate an image, you quickly switch to the Pixart model pipeline, and then store the image generated for subsequent reference:
# pipe: variable holding the Pixart generation pipeline with each of
# the compiled component models
images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
height=1024, # number of pixels
width=1024, # number of pixels
num_inference_steps=25 # Number of passes through the denoising model
).images
for idx, img in enumerate(images):
img.save(f"image_{idx}.png")
cleaning
To avoid causing additional costs, stop your EC2 example using either the AWS management console or the AWS command line interface (AWS CLI).
cONcluSiON
In this post, we walked through how to put Pixart-Sigma, a higher diffusion transformer, in the train instances. This post is the first in a series focused on the direction of diffusion transformers for different generation tasks in the neuron. To find out more about the direction of the models of the diffusion transformers with the neuron, refer to the diffusion transformers.
About
Achintya pinninti It is an architect of the Amazon Web Services solutions. It supports public sector customers, enabling them to achieve their goals using cloud. It specializes in the construction of data learning data and solutions to solve complex problems.
Miriam Lebowitz It is an architect of solutions focused on strengthening early phase beginnings in AWS. It uses its experience with it/ml to guide companies to choose and implement the right technologies for their business objectives, setting them for growing growth and innovation in the world of competitive starting.
Sadaf Rasool It is an architect of solutions at Annapurna Labs in AWS. Sadaf cooperates with customers to design machine learning solutions that address their critical business challenges. It helps clients train and place machine learning models using AWS Trainium or AWS Infertia Chips to accelerate their innovation journey.
John Gray It is an architect of solutions in Annapurna Labs, AWS, based on Seattle. In this role, John works with customers in their use of teaching and machinery, architects’ solutions to effectively solve their business problems and help them build a scaled prototype using AWS chips.
Leave feedback about this