Distributed Data Systems ★★★★ Master Level
Distributed data systems can achieve remarkably high performance and are key for organizations to deal with the ever-growing data volumes. If these systems are correctly configured, they can compute results faster than ever.
Course Badge
Language
English
Duration
2 days
Time
9:00-17:00
Certification
Yes
Lunch
Included
Recommended Level
Master
Upcoming courses
Currently there are no scheduled dates for this course. To be notified about upcoming dates, please choose "Reserve a seat".
Select tickets
We're sorry, but all tickets sales have ended because the event is expired.

*If you are a group of 5 or more, we are happy to accommodate a date for the training that suits you best. If so, please choose the "Reserve a seat" option.

Distributed data systems

About the course

The amount of data generated globally grows at an exponential rate, doubling every 2 years. Accordingly, the data volumes to be processed within organizations have also seen a rapid growth. Whether it is about identifying fraudulent activities based on analyzing billions of transactional records or about analyzing the flow of millions of customers on your website to increase conversions, the data volumes concerned cannot be handled by traditional data systems. Distributed data systems can achieve remarkably high performance and are key for organizations to deal with the ever-growing data volumes. If these systems are correctly configured, they can compute results faster than ever.  

For whom

This course is designed specifically for Data Scientists and Data Engineers. Many of the skills covered in this course involve preexisting knowledge outlined in the web scraping pre-work accompanying this badge, along with the Data Models and Manipulation (4204) badge. Participants must have experience interacting with APIs and expert programming skills in SQL and Python to keep up with this course.  

What you’ll learn

Principles of distributed data systems
  • Able to explain how distributed storage and distributed processing can strengthen each other
  • Able to explain the concepts of partitioning and multi-node processing
  • Able to identify which distributed data system to use for your use case
  Pitfalls of distributed systems
  • Able to identify if a distributed system scales in an optimal way
  • Able to design queries for optimal parallelization of processing jobs
  • Able to configure distributed data systems for optimized performance versus costs
  Build a distributed data system
  • Able to implement a distributed data system using Apache Spark
Theory and practical use All trainings in the GAIn portfolio combine high-quality standardized training material with theory sessions from experts and hands-on experience where you directly apply the material to real-life cases. Each training is developed by top of the field practitioners which means they are full of industry examples along with practical challenges and know-how, fueling the interactive discussions during training. We believe this multi-level approach creates the ideal learning environment for participants to thrive.