Disable attention mask during new token generation #1688

Andrei-Aksionov · 2024-08-21T11:06:02Z

Hey there 👋

This is a draft (a very ugly one) to check my assumption that you don't need an attention mask during generation for a batch size of 1 and kv-cache enabled. Attention mask is needed during a prefill stage, but not during a token generation.

The thing is, with SDPA if a mask is provided the flash attention is disabled, which makes it slower.

Important

Work is still in progress.
The output is ~~not~~ almost identical, but speedup is not significant.

With prompt:

litgpt generate microsoft/phi-2 --prompt "Tell me a very long story about llamas" --max_new_tokens=500

Current main:

Output: In the land of llamas, a wise old llama named Llamasaurus lived in a cozy little burrow. He was known for his vast knowledge and his ability to solve any problem. One day, a group of adventurous children stumbled upon his burrow and asked if he could teach them something amazing. Llamasaurus thought for a moment and decided to teach them about the power of llama magic. He showed them a small bag of llama dust and instructed them to sprinkle it in the air. Suddenly, the room filled with a gentle breeze, and the furniture started moving on its own. The children were amazed and couldn't believe their eyes. Llamasaurus explained that llama magic was a special gift that llamas possessed. He encouraged the children to use their newfound powers responsibly and to always remember the importance of kindness. From that day on, the children became known as the Llama Magicians, and they used their powers to bring joy and happiness to everyone they met. Llamasaurus was proud of their accomplishments and knew that he had passed down an important lesson to the next generation.

Time for inference 1: 4.86 sec total, 46.93 tokens/sec
Memory used: 5.76 GB

This PR:

Output: In the land of llamas, a wise old llama named Llamasaurus lived in a cozy little burrow. He was known for his vast knowledge and his ability to predict the weather. One day, as he was taking a leisurely stroll through the meadow, he noticed a strange phenomenon in the sky. The clouds were swirling in a pattern that he had never seen before. Intrigued, Llamasaurus decided to investigate. He climbed to the top of the highest hill and watched as the clouds transformed into different shapes and colors. It was like nothing he had ever seen before. As he watched, the clouds began to form the shape of a llama. Llamasaurus couldn't help but laugh at the sight. It was as if the sky was playing a game with him. He knew he had to share this incredible experience with others. So, he gathered all the llamas in the land and told them about the magical llama in the sky. They were amazed and couldn't wait to witness it for themselves. The next day, the llamas gathered at the hilltop and waited patiently. And just as Llamasaurus had predicted, the clouds transformed into a magnificent llama. It was a sight to behold. The llamas celebrated and danced in joy, grateful to Llamasaurus for sharing such a wonderful experience. From that day on, whenever the llamas looked up at the sky, they would remember the magical llama and the wise old llama who had brought them together.

Time for inference 1: 6.49 sec total, 48.07 tokens/sec
Memory used: 5.76 GB

Andrei-Aksionov added 3 commits August 21, 2024 11:00

Disable attention mask during new token generation

69c5e75

Merge branch 'main' into disable_attn_mask

77f00b1

Update with working code

b9d634f

Andrei-Aksionov closed this Aug 25, 2024

rasbt deleted the disable_attn_mask branch September 24, 2024 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable attention mask during new token generation #1688

Disable attention mask during new token generation #1688

Disable attention mask during new token generation #1688

Disable attention mask during new token generation #1688

Conversation