Collecting linked resources from Week 1 of the AI Engineering Course from ByteByteAI.
- AI Index: State of AI in 13 Charts: https://hai.stanford.edu/news/ai-index-state-ai-13-charts
- Does Anthropic crawl data from the web, and how can site owners block the crawler?: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
- GPT2 paper: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Common Crawl: https://commoncrawl.org/
- Tensorflow C4: https://www.tensorflow.org/datasets/catalog/c4
- HuggingFace C4: https://huggingface.co/datasets/allenai/c4
- Dolma: https://huggingface.co/papers/2402.00159
- Dolma paper: https://arxiv.org/pdf/2402.00159
- RefinedWeb: https://arxiv.org/abs/2306.01116
- FineWeb: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
- URL filtering blocklist: https://dsi.ut-capitole.fr/blacklists/
- Byte Pair Encoding (BPE):
- BPE visualization: https://process-mining.tistory.com/189
- HuggingFace BPE: https://huggingface.co/learn/llm-course/en/chapter6/5
- HuggingFace Llama3: https://huggingface.co/docs/transformers/en/model_doc/llama3
- Tokenization
- Tiktokenizer: http://tiktokenizer.vercel.app/
- Tiktoken library: https://github.com/openai/tiktoken
- Transformers
- Attention is All You Need: https://arxiv.org/abs/1706.03762
- The illustrated Transformer: https://jalammar.github.io/illustrated-transformer/
- Llama 3 paper: https://arxiv.org/abs/2407.21783
- Generation strategies: https://huggingface.co/docs/transformers/en/generation_strategies
- How to generate text: using different decoding methods for language generation with Transformers: https://huggingface.co/blog/how-to-generate
- Instruction tuning datasets: https://huggingface.co/collections/mapama247/instruction-tuning-datasets-65ddec58a16a00a4c84e5cf1
- Training language models to follow instructions with human feedback: https://arxiv.org/abs/2203.02155
- Alpaca: https://huggingface.co/datasets/tatsu-lab/alpaca