8000 zigzagcai (Zheng Cai) · GitHub
[go: up one dir, main page]

Skip to content
View zigzagcai's full-sized avatar
🎯
Evolving
🎯
Evolving
  • Shanghai, China
  • 23:59 (UTC +08:00)

Block or report zigzagcai

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
zigzagcai/README.md

Hi there 😄

Short Bio

I am Zheng Cai, nickname zigzagcai, an AI Infra Engineer and Lifelong Learner.

I have general interest in (M)LLM pre/post-train and love to share my thoughts via blogs on zhihu: 由A800平台训练InternLM-7B无法收敛引发的思考, 支持变长序列的Mamba-1训练.

🥑 For now, I have personal interest in Agentic RL and Inference-Time Scaling, and believe it will bring new paradiam shift.

🍓 For AI, I believe that more is different and intelligence emerges from complexity, and like the ideas behind The Bitter Lesson.

🍒 For Infra, I love to build practical distributed systems that orchestrate computation/communication/caching to scale up and scale out better, and believe in the ideas behind The Hardware Lottery.

So, what I try to do is to build a bridge between various accelerators and large models, with the hope of achieving efficient system-model co-design in the new AI paradiam (Self-Evolving Agentic AI Systems).

My Thinking

I love the general idea of open source (code/knowledge/and others) and love to learn from open source community and try my best to contribute back.

Selected thoughts I have ever shared or developed:

  1. CPU memory optimization when using PyTorch Dataloader over very large-scale datasets: pytorch/pytorch#13246 (comment)
  2. Analysis of numerical stability between Ring and Tree All-Reduce: NVIDIA/nccl#1055
  3. Implement variable-length training with Mamba State Space Models: state-spaces/mamba#244
  4. Avoid deadlock when training with ColossalAI over very large-scale GPU clusters: hpcaitech/ColossalAI#5625
  5. DeepSeek V3 671B trainable with FSDP+EP by hacking two lines of PyTorch FSDP codes: https://github.com/zigzagcai/DeepSeekV3
  6. Support nogil feature in NumPy-1.18.5 in the experimental CPython ecosystem: https://github.com/colesbury/numpy/commit/0d6ef2770268711ee6417792ba0da35fcb264bf5

Popular repositories Loading

  1. varlen_mamba varlen_mamba Public

    Forked from state-spaces/mamba

    Mamba SSM architecture that supports training on variable-length sequences

    Python 12 1

  2. DeepSeekV3 DeepSeekV3 Public

    Simple and efficient implementation of 671B DeepSeek V3 that trainable with FSDP+EP and minimal requirement of 256x A100/H100, targeted for HuggingFace ecosystem

    Python 7 1

  3. NVSHMEM NVSHMEM Public

    Forked from NVIDIA/nvshmem

    NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

    C++ 1

  4. devtools devtools Public

    My Personal Dev Tools, Made Life Easy.

    Shell

  5. TransformerEngine TransformerEngine Public

    Forked from NVIDIA/TransformerEngine

    A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…

    Python

  6. OpenRLHF OpenRLHF Public

    Forked from OpenRLHF/OpenRLHF

    An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)

    Python

0