Started on	2024-11-17 17:02
State	Finished
Completed on	2024-11-18 10:54
Time taken	17 hours 51 mins
Marks	12.00/27.00
Grade	13.33 out of 30.00 (44%)

Question 1CorrectMark 1.00 out of 1.00Flag question Question text In generative modeling, VAEs and GANs suffer from specific limitations, such as posterior collapse in VAEs and mode collapses in GANs. Considering these challenges, diffusion models are introduced as an alternative. Which of the following statements accurately describes a limitation unique to diffusion models and how it compares to VAE and GAN collapse phenomena? A. Diffusion models can suffer from over-smoothing due to poor noise scheduling, leading to a loss of sample diversity and detail. This issue resembles posterior collapse in VAEs, resulting from an inability to recover data complexity effectively after noise addition. B. Diffusion models primarily suffer from computational inefficiency due to rapid sampling requirements, which limits their ability to model complex distributions, making them analogous to GAN mode collapse.C. Diffusion models are highly susceptible to posterior collapse because they rely on probabilistic encoding and decoding, leading to a compressed latent space similar to VAEs.D. Diffusion models frequently experience mode collapse, like GANs, due to insufficient steps in the reverse process, causing the model to generate identical samples. FeedbackYour answer is correct. Diffusion models risk over-smoothing and loss of detail due to poor noise scheduling, which can result in a failure to recover data complexity, loosely comparable to posterior collapse in VAEs. This occurs not due to latent space compression but due to challenges in the noise reversal process. The correct answer is: Diffusion models can suffer from over-smoothing due to poor noise scheduling, leading to a loss of sample diversity and detail. This issue resembles posterior collapse in VAEs, resulting from an inability to recover data complexity effectively after noise addition. Question 2Partially correctMark 0.67 out of 2.00Flag question Question text Which of the following pairs is TRUE for the transformer? Multiple answers are allowed. The wrong answer will be penalized.[x] A. Masking - to mask out the values where we don’t want them to be attended and to impose causality for decoding. [x] B. Layer normalization - serves the same purpose as batch normalization, except it doesn't require batch samples to compute the statistics. [ ] C. Embedding module - to non-linearly transform the sequential data input.[x] D. Multi-head self-attention - to compute the correlation of a pair of input embedding multiple times. [x] E. FC (Feedforward) layers in the encoder/decoder blocks - de-embedding (inverse projection). Feedback Embedding module - to non-linearly transform the sequential data input.

False. Non-linearly transform -> Linearly transform FC (Feedforward) layers in the encoder/decoder blocks - de-embedding (inverse projection)
False. de-embedding (inverse projection) -> non-linear embedding/transformation Multi-head self-attention - to compute the correlation of a pair of input embedding multiple times.
False. to compute the correlation of every pair of input embedding in multiple contexts. Note each head is different due to different sets of transformed Q, K, and V. They represent different contexts, just like different filters used in CNN to capture different image features. The correct answers are: Layer normalization - serves the same purpose as batch normalization, except it doesn't require batch samples to compute the statistics., Masking - to mask out the values where we don’t want them to be attended and to impose causality for decoding. Question 3CorrectMark 1.00 out of 1.00Flag question Question text

You are working with the MADE model to generate synthetic samples from a dataset of user purchase sequences. The dataset includes sequences of items purchased by users, where each item is represented by a categorical value, and the order of items matters. To model these sequences, you use MADE, an autoregressive neural network that outputs a conditional probability distribution for each item in the sequence given the preceding items. During the training process, you notice that some item sequences contain missing data, where certain items are unknown. Which approaches best leverage the MADE model’s properties to handle these incomplete sequences? A. Mask the missing items during training so that MADE only learns from observed items and generates conditional probabilities based on available context. B. Use a different model, such as a GAN, as MADE cannot handle missing values without an i.i.d. assumption in the data.C. Impute the missing items by filling them with the most common item in the dataset, allowing MADE to train on fully filled sequences.D. Remove all incomplete sequences from the dataset, as MADE requires complete input data to generate accurate conditional probabilities. FeedbackYour answer is correct. MADE is designed to handle dependencies in autoregressive models through masking. This property can be extended to handle missing data by ignoring (masking) missing items during training. This way, MADE learns conditional probabilities based only on the observed context, a strength of the autoregressive approach, and the masking technique used in MADE.The correct answer is: Mask the missing items during training so that MADE only learns from observed items and generates conditional probabilities based on available context. Question 4CorrectMark 1.00 out of 1.00Flag question Question text You are tasked with generating high-quality samples from a diffusion model using classifier-free guidance with the following settings:

• Guidance strength w is set to 1.5. • The SNR sequence λt is chosen to be logarithmically spaced to prioritize gradual noise reduction over earlier steps.

You need to avoid mode collapse in the generated samples and preserve diversity in the output. Based on Algorithm 2, which adjustments would most likely help you achieve high-quality and diverse samples? A. Decrease the guidance strength w to 0.5 and sample zt+1deterministically (setting σλt+1∣λt=0).B. Increase the guidance strength w to 2.0 and set the variance term σλt+1∣λt2to a small constant value for all time steps.C. Increase w to 3.0 and use a fixed mean adjustment μλt+1∣λtwithout variance scaling, allowing the algorithm to prioritize conditional features at each step.D. Keep w=1.5, but increase the variance term σλt+1∣λt2 in step 5 dynamically based on time t, allowing more noise in the early stages and less towards the end. FeedbackYour answer is correct. It maintains guidance strength but varies the noise dynamically. Increasing variance in earlier steps encourages exploration and diversity while reducing it later improves convergence to high-quality samples. The correct answer is: Keep w=1.5, but increase the variance term σλt+1∣λt2 in step 5 dynamically based on time t, allowing more noise in the early stages and less towards the end. Question 5CorrectMark 1.00 out of 1.00Flag question Question text Which is FALSE for AR-based image generation models?A. Generated image quality and resolution are poorer than state-of-the-art GAN-based, diffusion-based, and VAE-based models.B. It lacks flexibility and versatility, such as image manipulation and editing, compared to GAN-based, Flow-based, and VAE-based models.C. Image generation is generally slower than that of GAN-based and VAE-based models.D. The iid assumption for data is imposed on the AR model. E. None of the above. FeedbackAR models do not typically assume i.i.d. data because each pixel depends on the previously generated pixels. This dependency structure is inherent in AR models, unlike models that assume independence. The correct answer is: The iid assumption for data is imposed on the AR model. Question 6Partially correctMark 1.33 out of 2.00Flag question Question text [ ] A. By using deterministic sampling, DDIM can guarantee that each generated sample will have maximum diversity, even with a reduced number of timesteps, which is crucial for producing a wide range of images.[ ] B. DDIM requires significantly more training time than DDPM to learn the deterministic sampling process, as it must be optimized for each possible number of sampling steps, which may hinder its efficiency.[x] C. Reducing the number of sampling steps in DDIM can speed up the image generation process without substantially sacrificing sample quality, making it suitable for applications requiring quick generation. [x] D. Since DDIM can operate with fewer steps without a major quality drop, it may be more efficient for applications with limited computational resources than traditional diffusion models like DDPM. [x] E. With deterministic sampling, DDIM allows you to control the number of sampling steps easily, balancing sample quality and generation speed, which can be adjusted depending on computational resources. [x] F. Unlike DDPM, DDIM's deterministic nature means that generated samples are deterministic and will be identical for the same initial noise and fixed number of steps, offering consistency for applications that require reproducible outputs. FeedbackYour answer is partially correct.You have selected too many options.The correct answers are: Reducing the number of sampling steps in DDIM can speed up the image generation process without substantially sacrificing sample quality, making it suitable for applications requiring quick generation., With deterministic sampling, DDIM allows you to control the number of sampling steps easily, balancing sample quality and generation speed, which can be adjusted depending on computational resources., Unlike DDPM, DDIM's deterministic nature means that generated samples are deterministic and will be identical for the same initial noise and fixed number of steps, offering consistency for applications that require reproducible outputs. Question 7CorrectMark 1.00 out of 1.00Flag question Question text

In VAEs and GANs, image editing is commonly performed by manipulating the latent vector representing an image's "compressed" version. This process can be seen as the "inverse" of image generation, where changes in the latent vector lead to targeted modifications in the final image. Diffusion models, however, generate images through a forward and reverse noise process rather than a single latent vector representation. How can image editing be performed with diffusion models, and what technique is typically applied to achieve targeted changes? A. Latent guidance is applied during the reverse diffusion process, where additional conditions are set on the noise estimate ϵθϵθ to guide the model toward desired edits while denoising.B. Diffusion models use latent interpolation, where a trained latent vector is manipulated directly, similar to GANs and VAEs, to produce an edited image in the reverse diffusion process.C. Diffusion models are limited in their ability to perform image editing since they do not have a traditional latent space. Thus, editing requires retraining the entire model with modified conditions.D. Image editing in diffusion models relies on conditioning during the reverse diffusion process, where the noisy image xtxt is iteratively modified with guidance from an external model or condition, allowing targeted changes as the noise is removed. FeedbackYour answer is correct. In diffusion models, image editing is often performed through conditioning on the desired features or attributes. For instance, guided diffusion methods or classifier-free guidance can be used, where the reverse diffusion process incorporates extra conditions (such as text or attribute embeddings) to steer the image toward the desired edits. The correct answer is: Image editing in diffusion models relies on conditioning during the reverse diffusion process, where the noisy image xtxt is iteratively modified with guidance from an external model or condition, allowing targeted changes as the noise is removed. Question 8IncorrectMark 0.00 out of 1.00Flag question Question text

A company is developing an AI-based language translation tool to translate English sentences into Korean. For this task, they use a Transformer model with encoder and decoder components. During testing, they noticed that while the encoder correctly captures the context of the input English sentences, the decoder occasionally produces incomplete or grammatically incorrect Korean translations. They suspect that the issue may be due to how the Transformer decoder handles the generation of each word in the translated sentence. Which of the following adjustments to the Transformer decoder will most likely improve the translation quality? A. Apply masked self-attention in the decoder to ensure each token only attends to itself and previous tokens in the output sequence.B. Apply unmasked self-attention in the decoder, allowing each token to attend to all tokens in the sequence.C. Increase the encoder depth by adding more encoder layers to capture input context better.D. Use greedy decoding to quickly select the highest probability token at each decoding step. FeedbackYour answer is incorrect.The correct answer is: Apply masked self-attention in the decoder to ensure each token only attends to itself and previous tokens in the output sequence. Question 9CorrectMark 1.00 out of 1.00Flag question Question text For a DDPM model that undergoes T steps of forward diffusion, the total loss function contains a sum of ___ terms. A. T+1 B. T+2C. TD. T-2E. T-1 Feedback There are T+1 terms (L0 + L1 + ... +LT) in total.The correct answer is: T+1 Question 10Not answeredMarked out of 1.00Flag question Question text

You are given a multi-head self-attention module with the following parameters: • Input sequence length = 5 tokens • Embedding dimension (model dimension) = 64 • Number of attention heads = 4 Each attention head will have its own set of query, key, and value projection matrices, and the output of each head will be concatenated before passing through a final linear layer. Assume that:

Each head splits the embedding dimension equally.
All linear projections (query, key, value, and output layers) do not change the sequence length. Calculate this multi-head self-attention module's total number of parameters, including the query, key, value, and output projection matrices. A. 15912B. 13618C. 16519D. 16144E. 16384 FeedbackYour answer is incorrect.

Embedding Dimension (Model Dimension) = 64Number of Attention Heads = 4Each Head Dimension: Since the embedding dimension is split equally among 4 heads, each head has a dimension of:Head Dimension=644=16 Each attention head requires its own set of query, key, and value projection matrices. The size of each matrix for one head is calculated as follows:

Each head has: • Query matrix: 16×64 • Key matrix: 16×64 • Value matrix: 16×64 Since each matrix has dimensions 16×6416×64, the number of parameters per matrix is:16×64=1024 For each head, the total parameters for query, key, and value matrices are:3×1024=3072 Since there are 4 heads, the total parameters for all heads combined are:3072×4=12288

After concatenating the outputs from each of the 4 heads (each of size 16, resulting in 16×4=6416×4=64 dimensions), there is an output projection layer that maps this concatenated output back to the model dimension (64). The output projection matrix size is:64×64=4096

Adding the parameters from both the attention heads and the output projection layer, we get:12288+4096=16384

The correct answer is: 16384 Question 11Partially correctMark 0.67 out of 2.00Flag question Question text

[ ] A. Stepwise Conditioning Adjustment: Adjust the conditioning input from the style image at each step of the reverse denoising process to gradually incorporate stylistic features, allowing a balance between the content and style elements in the final output.[ ] B. Pure Gaussian Initialization: Begin the diffusion process with pure Gaussian noise and rely solely on conditioning based on the content image during the reverse process to integrate both style and content into the final image.[x] C. Content Similarity Constraint: Apply a content similarity constraint using a pretrained network (e.g., VGG) to ensure that the denoised image closely aligns with the structural details of the content image, thereby preserving its integrity. A Content Similarity Constraint in diffusion-based style transfer ensures that the generated image retains the key structural features of the original content image, even as it adopts the style of another image. During the reverse denoising process in the diffusion model, this constraint is applied by comparing the features of the generated image and the content image using a pretrained network (like VGG) at each step. By minimizing the difference in these features, the model preserves the core structure of the content image while blending in the style image’s textures and colors. This approach maintains a balance between content and style, producing a result that’s visually consistent with both sources.[x] D. Conditioning from Content Image: At each reverse diffusion step, utilize conditioning information derived from the content image to guide the process, ensuring that the structural details of the content image are retained in the stylized output. [x] E. Style-Specific Feature Extraction: Extract key visual characteristics from the style image and incorporate them as conditioning information in the reverse diffusion steps to ensure that the final image reflects the intended style. [ ] F. Extended Denoising Steps: Increase the number of reverse diffusion steps beyond typical settings to strengthen the influence of the content image, thereby achieving higher fidelity to the original content. FeedbackYour answer is partially correct.You have correctly selected 2.The correct answers are: Stepwise Conditioning Adjustment: Adjust the conditioning input from the style image at each step of the reverse denoising process to gradually incorporate stylistic features, allowing a balance between the content and style elements in the final output., Content Similarity Constraint: Apply a content similarity constraint using a pretrained network (e.g., VGG) to ensure that the denoised image closely aligns with the structural details of the content image, thereby preserving its integrity., Conditioning from Content Image: At each reverse diffusion step, utilize conditioning information derived from the content image to guide the process, ensuring that the structural details of the content image are retained in the stylized output. Question 12Partially correctMark 1.33 out of 2.00Flag question Question text

Multiple answers are allowed. The wrong answer will be penalized.[x] A. Reinforcement Learning with Human Feedback (RLHF) – to adjust the model’s response quality and tone based on human feedback and preferences. [ ] B. Data augmentation – to improve accuracy by artificially generating additional examples of user queries.[x] C. Row-rank Adaptation (ROLA) fine-tuning – to enhance the model's alignment with company-specific language style and refine responses based on the desired tone. [x] D. Prompt engineering – to adjust the output quality through manual prompt modifications without re-training the model. [ ] E. Self-supervised learning – to fine-tune the model’s performance without human feedback or an updated data corpus.[x] F. Retrieval-Augmented Generation (RAG) – to improve factual accuracy by pulling updated information from a knowledge base or document store. FeedbackYour answer is partially correct.You have selected too many options. (RLHF): Useful for refining the tone and quality of responses based on human feedback, ensuring alignment with customer preferences. (RAG): Helps the model access the latest product information, addressing the need for updated and accurate responses. (ROLA fine-tuning): Effective in aligning the model with specific language styles and improving the response tone and quality based on feedback.The correct answers are: Reinforcement Learning with Human Feedback (RLHF) – to adjust the model’s response quality and tone based on human feedback and preferences., Retrieval-Augmented Generation (RAG) – to improve factual accuracy by pulling updated information from a knowledge base or document store., Row-rank Adaptation (ROLA) fine-tuning – to enhance the model's alignment with company-specific language style and refine responses based on the desired tone. Question 13CorrectMark 1.00 out of 1.00Flag question Question text Suppose the maximum input token for GPT3.5 is 30. How many padding tokens will be introduced if the input text is "Can you suggest fun activities for a family of four to do indoors on a rainy day?" A. 11 B. 0C. 17D. 7E. 5 Feedback The input token size to the GPT-3.5 (a decoder of the transformer) is fixed at 30. According to https://platform.openai.com/tokenizer, the token number for the sentence is 19. Hence, 11 padding tokens are to be introduced.The correct answer is: 11 Question 14Not answeredMarked out of 8.00Flag question Question text In a classifier-guided diffusion model, at timestep t, You have: • A noisy image xt • Classifier gradient ∇p(y∣x) • Guidance scale c=2.5 • Base diffusion model noise prediction ϵθ(xt)=[0.15,−0.25] • The classifier outputs probabilities for three classes. • Assuming p(y)=[0.6,0.3,0.1] (probabilities for each class). weight the importance of each class's gradient.

Suppose a CNN-based classifier is adopted. The classifier comprises a convolutional backbone and a classifier =softmax(Wz), where W = [0.2, 0.5; -0.3, 0.8; 0.1, -0.4] is a class weight matrix, z is an embedding vector, after the backbone, and y is the softmax output. Hint: The is a 3x2 matrix, indicating three classes (or outputs) associated with the two. The chain rule of calculus is required to solve it.Each row in the gradient matrix represents the partial derivatives of one class probability with respect to each component of x; hence, there are three rows (one per class) and two columns (one per component of x).

(a) Assuming x = z, compute (b) Compute ϵ~θ(xt) (c) What could be a potential downside of using a very high guidance scale c in a classifier-guided diffusion model?

Feedback

Only a partial mark is awarded if no working step is shown.

The embedding z(=x) is intentionally unspecified to assess understanding and to detect chatGPT generated answer. You have to assume and show the z you used. Solution skeleton: Let h = Wz and p(y|z) = softmax(h)

Suppose z = [0.1, 0.2], compute h and p(y|z), then compute

Below is the Python code for the calculation: import numpy as np from scipy.special import softmax

Part (a): Compute ∇p(y|x)

Define the weight matrix W and embedding vector z

W = np.array([[0.2, 0.5], [-0.3, 0.8], [0.1, -0.4]]) z = np.array([0.1, 0.2]) # Embedding vector

Compute the linear output and apply softmax to get probabilities

linear_output = W @ z p_y_given_x = softmax(linear_output)

Compute the gradient ∇p(y|x) using the Jacobian of softmax

gradient_p_y_given_x = np.zeros((3, 2)) # 3 classes, 2 components in x for i in range(3): # for each class for j in range(2): # for each component in x if i == j: gradient_p_y_given_x[i, j] = p_y_given_x[i] * (1 - p_y_given_x[i]) else: gradient_p_y_given_x[i, j] = -p_y_given_x[i] * p_y_given_x[j]

Part (b): Compute ε̃θ(xt)

Given values

epsilon_theta_xt = np.array([0.15, -0.25]) c = 2.5 p_y = np.array([0.6, 0.3, 0.1])

Compute ε̃θ(xt)

tilde_epsilon_theta_xt = epsilon_theta_xt + c * (gradient_p_y_given_x.T @ p_y) print("Gradient ∇p(y|x):\n", gradient_p_y_given_x)

print("Tilde εθ(xt):", tilde_epsilon_theta_xt)

(a) [0.2282 -0.1255; -0.1255 0.2293; -0.1027 -0.1038] (based on assumption z = [0.1, 0.2], )

(b) [-0.1255 0.2293 -0.1038]

(c). It may overly constrain the model to generate images that strictly adhere to the classifier’s preferences, potentially reducing diversity and realism.

Question 15Partially correctMark 1.00 out of 2.00Flag question Question text Which of the following is/are FALSE for DDPM? Multiple answers are allowed. The wrong answer will be penalized. [ ] A. DDPM is optimized by minimizing the exact negative log-likelihood.[ ] B. DDPM can be reframed as a scored-based model, and the noise prediction network (Unet) is implicitly used to model the score function. [ ] C. In DDPM, noise is added to the data gradually through a series of steps, and then a denoising process is used to remove the added noise gradually.[x] D. DDPM is non-Markovian. [ ] E. The U-Net in DDPM is used for noise estimation at different time steps.[ ] F. DDPM is inspired by non-equilibrium thermodynamics. Feedback DDPM is optimized by minimizing the exact negative log-likelihood.

False. The likelihood of DDPM is intractable, thus it is approximated with VI techniques (ELBO). DDPM is non-Markovian.
False. It is Markovian.The correct answers are: DDPM is optimized by minimizing the exact negative log-likelihood., DDPM is non-Markovian.
• Guidance strength w is set to 1.5.
• The SNR sequence λ is chosen to be logarithmically spaced to prioritize gradual noise reduction over earlier steps.

t

• Input sequence length = 5 tokens
• Embedding dimension (model dimension) = 64
• Number of attention heads = 4

1. Each head splits the embedding dimension equally.
1. All linear projections (query, key, value, and output layers) do not change the sequence length.

Embedding Dimension (Model Dimension) = 64
Number of Attention Heads = 4
Each Head Dimension: Since the embedding dimension is split equally among 4 heads, each head has a dimension of:
• Query matrix:

16×64
• Key matrix:

16×64