Apache Kafka, distributed system architecture udemy, youtube, or something nice learning course

Hi, my background is Backend Engineer. But I have only experience is AWS SQS and SNS (for queue service).

My next company uses Apache Kafka with distributed system architecture. And I'd like to prepare to learn them.

By the way, I am looking for the Udemy courses, youtube, and articles. But there are a lot of them.

If anyone knows step by step level up course, articles or something like that, please recommend me. Thank you so much.

πŸ‘︎ 10
πŸ’¬︎
πŸ‘€︎ u/rurikana
πŸ“…︎ Dec 18 2021
🚨︎ report
[R] Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters

A research team from Kwai Inc., Kuaishou Technology and ETH ZΓΌrich builds PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy for extremely large deep learning recommender systems of up to 100 trillion parameters.

Here is a quick read: Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters.

The code is available on the project’s GitHub. The paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters is on arXiv.

πŸ‘︎ 113
πŸ’¬︎
πŸ‘€︎ u/Yuqing7
πŸ“…︎ Nov 26 2021
🚨︎ report
Don't Just Track Your Machine Learning Experiments, Version Them - Distributed Versioning vs. Centralized Tracking for ML Experiments

Machine learning experiments often get split between Git for code and experiment tracking tools for meta-information - because Git can't manage or compare all that experiment meta-information, but it is still better for code.

The following guide explains how to apply DVC for ML experiment versioning that combines experiment tracking and version control: Don't Just Track Your ML Experiments, Version Them - Instead of managing these separately, keep everything in one place and get the benefits of both, like:

  • Experiments as code: Track meta-information in the repository and version it like code.
  • Versioned reproducibility: Save and restore experiment state, and track changes to only execute what's new.
  • Distributed experiments: Organize locally and choose what to share, reusing your existing repo setup.

Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git, which becomes a store for experiment meta-information. The article above shows how with DVC tool, you can push experiments just like Git branches, giving you flexibility to share experiment you choose.

πŸ‘︎ 8
πŸ’¬︎
πŸ‘€︎ u/cmstrump
πŸ“…︎ Dec 10 2021
🚨︎ report
[P] Cost of distributed deep learning on AWS

Hi everyone,

I am trying to learn some distributed deep learning techniques such as parameter servers and ring all reduce. I found this paper by Uber where they describe their implementation of ring all reduce. While the implementation looks fairy straight forward, I am unsure about the costs.

The problem is that I don't have access to multiple GPUs so I thought I would use AWS EC2. Since I have never used it before (I have used GCP a fair amount but not for distributed deep learning) I was hoping someone here could shine some light on the costs of using say 2 GPUs.

My initial idea was to train a CNN using ImageNet (1k). So I would use S3 and EC2 with two basic GPUs as my goal is simply to see how much faster I can reach convergence rather than achieving state of the art times.

Would this be too expensive for a personal project? I'd rather not spend over 200 GBP.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/ishotdesheriff
πŸ“…︎ Nov 29 2021
🚨︎ report
Distributed SQL database in Rust, written as a learning project github.com/erikgrinaker/t…
πŸ‘︎ 13
πŸ’¬︎
πŸ‘€︎ u/nybon
πŸ“…︎ Dec 06 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a DNN model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Paper:https://arxiv.org/pdf/2110.15032.pdf; Code: https://github.com/Oneflow-Inc/oneflow

Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning.

We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks.

πŸ‘︎ 14
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Oct 29 2021
🚨︎ report
Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters syncedreview.com/2021/11/…
πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/Dr_Singularity
πŸ“…︎ Nov 29 2021
🚨︎ report
Distributed Deep Learning Explained
πŸ‘︎ 205
πŸ’¬︎
πŸ‘€︎ u/YLyu
πŸ“…︎ Oct 14 2021
🚨︎ report
What is PyTorch Distributed - Distributed Deep Learning Model Training youtube.com/watch?v=6IbXC…
πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/aniketmaurya
πŸ“…︎ Nov 09 2021
🚨︎ report
The distributed deep learning framework OneFlow v0.5.0RC came out!

We just launched OneFlow v0.5.0RC. With this release, OneFlow has the same API with PyTorch in eager mode, but more powerful and friendly API for distributed training. The consistent tensor enables model parallelism and pipeline parallelism without manual programming!

Changelog link:https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.5rc1

Highlights

  • First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
  • Drop-in replacement of import torchfor existing Pytorch projects. You could test it by inter-changing import oneflow as torchand import torch as flow
  • nn.Module for eager execution
  • nn.Graph for lazy execution
  • DDP for data parallel
πŸ‘︎ 10
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Sep 16 2021
🚨︎ report
ChampSchool has a new distributed learning curriculum πŸ‘€
πŸ‘︎ 29
πŸ’¬︎
πŸ‘€︎ u/Blackbeard-7
πŸ“…︎ Aug 11 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch /r/deeplearning/comments/…
πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Nov 02 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch /r/deeplearning/comments/…
πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Nov 02 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch /r/deeplearning/comments/…
πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Nov 02 2021
🚨︎ report
Inoculated last Monday and already looking amazing! Laid all 6 on their side and moisture is evenly distributed. Learning so much from UBT! reddit.com/gallery/q6xha3
πŸ‘︎ 7
πŸ’¬︎
πŸ‘€︎ u/Hodlertoadler
πŸ“…︎ Oct 12 2021
🚨︎ report
How do I actually do distributed learning on AWS?

Hi! I am working on a custom environment that runs on node.js and is operated using an agent in python. This agent uses a network to play and stores the needed info in a replay buffer. Then I take this buffer and learn from it, and consequently update the network.

Repeat as needed.

So what I envision is around 10-20 EC2s playing this game and adding to their own buffers. And once done with their batch, they dump their buffer into an s3 bucket. This triggers a learning EC2, that learns from the buffer and updates the model which is in a different bucket.

My question is - am I supposed to just spin up each EC2 instance individually? I'm aware of AMIs, but is there no way to just run the script once and AWS autoscales by adding EC2s? I know there are Auto-Scaling groups but I have no clue how to practically set one up or if it is even appropriate.

Any guidance on this AWS front is much appreciated!

πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/XcessiveSmash
πŸ“…︎ Sep 06 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch (r/MachineLearning) reddit.com/r/MachineLearn…
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Peerism1
πŸ“…︎ Oct 30 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch /r/deeplearning/comments/…
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Oct 30 2021
🚨︎ report
Multi-Agent RL: Centralized Learning Distributed Execution

Hey there :D

In OpenAI's Hide and Seek work, they used Centralized Learning Distributed Execution using PPO. What I understand is that they train a shared network between all agents, since they are cooperating. My question is, for the case of 2 agents for example, a single step in the environment is equivalent to two experiences? (so two rewards/state values per step are used to update PPO?)

If that is the case, then if my batch size is 400 for example, does this entail 400 environment steps or 400 experiences (in this case 200 environment steps).

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/AhmedNizam_
πŸ“…︎ Oct 13 2021
🚨︎ report
[GTS] Distributed Deep Learning Explained
πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/GTSBot
πŸ“…︎ Oct 15 2021
🚨︎ report
Getting started with Distributed Computing / Distributed Machine Learning using Ray in Notebooks using Free AWS Resources!

I tried getting Ray setup from an online hosted notebook recently and it was quite hard, so I wrote some articles using Deepnote (an online Jupyter notebook service) that show how you can get started (for FREE!) with Ray in the cloud. If you click "Launch in Deepnote" all the code is available and you can run the code yourself!

Integrating Deepnote with (free!) Burstable Distributed Computing using Python, AWS EC2 and Ray

Distributed Machine Learning using Ray

πŸ‘︎ 10
πŸ’¬︎
πŸ‘€︎ u/HellBriinger
πŸ“…︎ Aug 28 2021
🚨︎ report
Please recommend good text based learning resource to learn distributed systems , Hadoop and Apache spark using Scala ?

I prefer to learn via reading, so looking forward to text based learning resources. Something like mozilla guide for web developer. Please help .

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/short_n_ugly
πŸ“…︎ Sep 09 2021
🚨︎ report
graph2vec: Learning Distributed Representations of Graphs (Paper Walkthrough) [D]

graph2vec proposes a technique to embed entire graph in high dimension vector space. It is inspired from doc2vec learning approach over graphs and rooted subgraphs. πŸ”₯

It achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels. πŸ‘ΎπŸ‘Ύ

https://youtu.be/h400_OMWNLo

πŸ‘︎ 14
πŸ’¬︎
πŸ‘€︎ u/prakhar21
πŸ“…︎ Aug 20 2021
🚨︎ report
How To Manage Machine Learning Experiments as Code with Git and DVC - Distributed Versioning vs. Centralized Tracking

ML experiment versioning combines experiment tracking and version control and keep everything in one place and get all its benefits: Don't Just Track Your ML Experiments, Version Them

  • Experiments as code: Track meta-information in the repository and version it like code.
  • Versioned reproducibility: Save and restore experiment state, and track changes to only execute what's new.
  • Distributed experiments: Organize locally and choose what to share, reusing your existing repo setup.

Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git, which becomes a store for experiment meta-information. The article above shows how with DVC tool, you can push experiments just like Git branches, giving you flexibility to share experiment you choose.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/cmstrump
πŸ“…︎ Dec 09 2021
🚨︎ report
[R] Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters

A research team from Kwai Inc., Kuaishou Technology and ETH ZΓΌrich builds PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy for extremely large deep learning recommender systems of up to 100 trillion parameters.

Here is a quick read: Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters.

The code is available on the project’s GitHub. The paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters is on arXiv.

πŸ‘︎ 16
πŸ’¬︎
πŸ‘€︎ u/Yuqing7
πŸ“…︎ Nov 26 2021
🚨︎ report
DVC 2.0 Introduces Versioning for Learning Experiments - Distributed Versioning vs. Centralized Tracking for ML Experiments with DVC and Git

ML experiments often get split between Git for code and experiment tracking tools for meta-information - because Git can't manage or compare all that experiment meta-information, but it is still better for code.

The following guide explains how to apply DVC for ML experiment versioning that combines experiment tracking and version control: Don't Just Track Your ML Experiments, Version Them - Instead of managing these separately, keep everything in one place and get the benefits of both, like:

  • Experiments as code: Track meta-information in the repository and version it like code.
  • Versioned reproducibility: Save and restore experiment state, and track changes to only execute what's new.
  • Distributed experiments: Organize locally and choose what to share, reusing your existing repo setup.

Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git, which becomes a store for experiment meta-information. The article above shows how with DVC tool, you can push experiments just like Git branches, giving you flexibility to share experiment you choose.

πŸ‘︎ 2
πŸ’¬︎
πŸ“…︎ Dec 11 2021
🚨︎ report
Distributed Machine Learning Experiments Versioning Instead of Centralized Experiment Tracking - Managing ML Experiments as Code with Git and DVC

ML experiment versioning combines experiment tracking and version control and keep everything in one place and get all its benefits: Don't Just Track Your ML Experiments, Version Them

  • Experiments as code: Track meta-information in the repository and version it like code.
  • Versioned reproducibility: Save and restore experiment state, and track changes to only execute what's new.
  • Distributed experiments: Organize locally and choose what to share, reusing your existing repo setup.

Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git, which becomes a store for experiment meta-information. The article above shows how with DVC tool, you can push experiments just like Git branches, giving you flexibility to share experiment you choose.

πŸ‘︎ 2
πŸ’¬︎
πŸ“…︎ Dec 07 2021
🚨︎ report
How To Manage Machine Learning Experiments as Code with Git and DVC - Distributed Versioning vs. Centralized Tracking

ML experiment versioning combines experiment tracking and version control and keep everything in one place and get all its benefits: Don't Just Track Your ML Experiments, Version Them

  • Experiments as code: Track meta-information in the repository and version it like code.
  • Versioned reproducibility: Save and restore experiment state, and track changes to only execute what's new.
  • Distributed experiments: Organize locally and choose what to share, reusing your existing repo setup.

Experiment versioning treats experiments as code. It saves all metrics, hyperparameters, and artifact information in text files that can be versioned by Git, which becomes a store for experiment meta-information. The article above shows how with DVC tool, you can push experiments just like Git branches, giving you flexibility to share experiment you choose.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/cmstrump
πŸ“…︎ Dec 09 2021
🚨︎ report
[R] Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters

A research team from Kwai Inc., Kuaishou Technology and ETH ZΓΌrich builds PERSIA, an efficient distributed training system that leverages a novel hybrid training algorithm to ensure both training efficiency and accuracy for extremely large deep learning recommender systems of up to 100 trillion parameters.

Here is a quick read: Kwai, Kuaishou & ETH ZΓΌrich Propose PERSIA, a Distributed Training System That Supports Deep Learning-Based Recommenders of up to 100 Trillion Parameters.

The code is available on the project’s GitHub. The paper PERSIA: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters is on arXiv.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Yuqing7
πŸ“…︎ Nov 26 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch[P]

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a DNN model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Paper:https://arxiv.org/pdf/2110.15032.pdf; Code: https://github.com/Oneflow-Inc/oneflow

Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning.

We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks.

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Oct 29 2021
🚨︎ report
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a DNN model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Paper:https://arxiv.org/pdf/2110.15032.pdf; Code: https://github.com/Oneflow-Inc/oneflow

Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning.

We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks.

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Oct 30 2021
🚨︎ report
The distributed deep learning framework OneFlow v0.5.0RC came out! [Project]

We just launched OneFlow v0.5.0RC. With this release, OneFlow has the same API with PyTorch in eager mode, but more powerful and friendly API for distributed training. The consistent tensor enables model parallelism and pipeline parallelism without manual programming!

Changelog link:https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.5rc1

Highlights

  • First class support for eager execution. The deprecated APIs are moved to oneflow.compatible.single_client
  • Drop-in replacement of import torchfor existing Pytorch projects. You could test it by inter-changing import oneflow as torch and import torch as flow
  • nn.Module for eager execution
  • nn.Graph for lazy execution
  • DDP for data parallel
πŸ‘︎ 5
πŸ’¬︎
πŸ‘€︎ u/Just0by
πŸ“…︎ Sep 19 2021
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.