-
4254 - Distributed data systems
June 29, 2020 - June 30, 2020
9:00 am - 5:00 pm
Distributed data systems
About the course
The amount of data generated globally grows at an exponential rate, doubling every 2 years. Accordingly, the data volumes to be processed within organizations have also seen a rapid growth. Whether it is about identifying fraudulent activities based on analyzing billions of transactional records or about analyzing the flow of millions of customers on your website to increase conversions, the data volumes concerned cannot be handled by traditional data systems. Distributed data systems can achieve remarkably high performance and are key for organizations to deal with the ever-growing data volumes. If these systems are correctly configured, they can compute results faster than ever.
For whom
This course is designed specifically for Data Scientists and Data Engineers. Many of the skills covered in this course involve preexisting knowledge outlined in the web scraping pre-work accompanying this badge, along with the Data Models and Manipulation (4204) badge. Participants must have experience interacting with APIs and expert programming skills in SQL and Python to keep up with this course.
What you’ll learn
Principles of distributed data systems
- Able to explain how distributed storage and distributed processing can strengthen each other
- Able to explain the concepts of partitioning and multi-node processing
- Able to identify which distributed data system to use for your use case
Pitfalls of distributed systems
- Able to identify if a distributed system scales in an optimal way
- Able to design queries for optimal parallelization of processing jobs
- Able to configure distributed data systems for optimized performance versus costs
Build a distributed data system
- Able to implement a distributed data system using Apache Spark
Theory and practical use
All trainings in the GAIn portfolio combine high-quality standardized training material with theory sessions from experts and hands-on experience where you directly apply the material to real-life cases. Each training is developed by top of the field practitioners which means they are full of industry examples along with practical challenges and know-how, fueling the interactive discussions during training. We believe this multi-level approach creates the ideal learning environment for participants to thrive.