Compare commits

...

12 Commits

Author SHA1 Message Date
Kamran Ahmed
f6589f9a8c Merge branch 'master' into content/data-engineer 2025-08-28 14:59:38 +01:00
Javi Canales
5c287d825f clean typo in de roadmap 2025-08-27 09:47:14 +02:00
Javi Canales
933c568066 add 4 missing contents 2025-08-22 12:20:51 +02:00
Javi Canales
3b809d9f81 add last batch of content for DE roadmap. Ready to PR 2025-08-22 12:18:20 +02:00
Javi Canales
efd5d20089 new 30 contents for DE roadmap 2025-08-21 17:19:35 +02:00
Javi Canales
3a4514b9f1 add 30 new content for DE roadmap 2025-08-21 14:31:58 +02:00
Javi Canales
c9ee8b0ee0 new batch in DE roadmap with 25 contents 2025-08-20 11:48:00 +02:00
Javi Canales
0d5bce309e new batch of content from DE roadmap 2025-08-19 13:34:44 +02:00
Javi Canales
87a2d493e2 batch of new content for data engineer roadmap 2025-08-18 11:26:29 +02:00
Javi Canales
4c7daa6a5b add content to DE roadmap and fix some typos in content appearing in several roadmaps 2025-08-14 16:55:13 +02:00
Javi Canales
88ac6406f9 add content to data engineer roadmap 2025-08-14 14:16:38 +02:00
Kamran Ahmed
374a56eeff Add basic content 2025-08-13 15:34:04 +01:00
186 changed files with 1729 additions and 186 deletions

View File

@@ -1 +1,8 @@
# A/B Testing
# A/B Testing
A/B testing is a way to compare two versions of something to see which one works better. You split your audience into two groups, one sees version A, the other sees version B — and then you measure which version gets better results, like more clicks, sales, or sign-ups. This helps you make decisions based on real data instead of guesses.
Visit the following resources to learn more:
- [@article@A software engineer's guide to A/B testing](https://posthog.com/product-engineers/ab-testing-guide-for-engineers)
- [@video@A/B Testing for Beginners](https://www.youtube.com/watch?v=VpTlNRUcIDo)

View File

@@ -1 +1,8 @@
# Amazon EC2 ( Compute)
# Amazon EC2 ( Compute)
Amazon Elastic Compute Cloud (EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. EC2s simple web service interface allows you to obtain and configure capacity with minimal friction. EC2 enables you to scale your compute capacity, develop and deploy applications faster, and run applications on AWS's reliable computing environment. You have the control of your computing resources and can access various configurations of CPU, Memory, Storage, and Networking capacity for your instances.
Visit the following resources to learn more:
- [@official@EC2 - User Guide](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html)
- [@video@Introduction to Amazon EC2](https://www.youtube.com/watch?v=eaicwmnSdCs)

View File

@@ -1 +1,7 @@
# Amazon RDS (Database)
# Amazon RDS (Database)
Amazon RDS (Relational Database Service) is a web service from Amazon Web Services. It's designed to simplify the setup, operation, and scaling of relational databases in the cloud. This service provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks. RDS supports six database engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. These engines give you the ability to run instances ranging from 5GB to 6TB of memory, accommodating your specific use case. It also ensures the database is up-to-date with the latest patches, automatically backs up your data and offers encryption at rest and in transit.
Visit the following resources to learn more:
- [@official@Amazon RDS](https://aws.amazon.com/rds/)

View File

@@ -1 +1,7 @@
# Amazon RDS (Database)
# Amazon RDS (Database)
Amazon RDS (Relational Database Service) is a web service from Amazon Web Services. It's designed to simplify the setup, operation, and scaling of relational databases in the cloud. This service provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks. RDS supports six database engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. These engines give you the ability to run instances ranging from 5GB to 6TB of memory, accommodating your specific use case. It also ensures the database is up-to-date with the latest patches, automatically backs up your data and offers encryption at rest and in transit.
Visit the following resources to learn more:
- [@official@Amazon RDS](https://aws.amazon.com/rds/)

View File

@@ -1 +1,8 @@
# Amazon Redshift
# Amazon Redshift
Amazon Redshift is a cloud-based data warehouse service from Amazon that lets you store and analyze large amounts of data quickly. Its designed for running complex queries on huge datasets, so businesses can use it to turn raw data into useful reports and insights. You can load data into Redshift from many sources, and then use SQL to explore it, just like you would with a regular database — but its optimized to handle much bigger data and run faster.
Visit the following resources to learn more:
- [@official@Amazon Redshift](https://aws.amazon.com/redshift/)
- [@video@Getting Started with Amazon Redshift - AWS Online Tech Talks](https://www.youtube.com/watch?v=dfo4J5ZhlKI)

View File

@@ -1 +1,7 @@
# Apache Airflow
# Apache Airflow
Apache Airflow is an open-source tool that helps you schedule, organize, and monitor workflows. Think of it like a to-do list for your data tasks, but smarter — you can set tasks to run in a specific order, track their progress, and see what happens if something fails. Its often used for automating data pipelines so that data moves, gets processed, and is ready for use without manual work.
Visit the following resources to learn more:
- [@official@Apache Airflow](https://airflow.apache.org/)

View File

@@ -1 +1,7 @@
# Apache Hadoop YARN
# Apache Hadoop YARN
Apache Hadoop YARN (Yet Another Resource Negotiator) is the part of Hadoop that manages resources and runs jobs on a cluster. It has a ResourceManager that controls all cluster resources and an ApplicationMaster for each job that schedules and runs tasks. YARN lets different tools like MapReduce and Spark share the same cluster, making it more efficient, flexible, and reliable.
Visit the following resources to learn more:
- [@video@Hadoop Yarn Tutorial](https://www.youtube.com/watch?v=6bIF9VwRwE0)

View File

@@ -1 +1,12 @@
# Apache Kafka
# Apache Kafka
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation. It is written in Scala and Java and operates based on a message queue, designed to handle real-time data feeds. Kafka functions as a kind of message broker service in between the data producers and the consumers, facilitating efficient transmission of data. It can be viewed as a durable message broker where applications can process and reprocess streamed data. Kafka is a highly scalable and fault-tolerant system which ensures data delivery without loss.
Visit the following resources to learn more:
- [@official@Apache Kafka](https://kafka.apache.org/quickstart)
- [@offical@Apache Kafka Streams](https://docs.confluent.io/platform/current/streams/concepts.html)
- [@offical@Kafka Streams Confluent](https://kafka.apache.org/documentation/streams/)
- [@video@Apache Kafka Fundamentals](https://www.youtube.com/watch?v=B5j3uNBH8X4)
- [@video@Kafka in 100 Seconds](https://www.youtube.com/watch?v=uvb00oaa3k8)
- [@feed@Explore top posts about Kafka](https://app.daily.dev/tags/kafka?ref=roadmapsh)

View File

@@ -1 +1,9 @@
# Apache Spark
# Apache Spark
Apache Spark is an open-source distributed computing system designed for big data processing and analytics. It offers a unified interface for programming entire clusters, enabling efficient handling of large-scale data with built-in support for data parallelism and fault tolerance. Spark excels in processing tasks like batch processing, real-time data streaming, machine learning, and graph processing. Its known for its speed, ease of use, and ability to process data in-memory, significantly outperforming traditional MapReduce systems. Spark is widely used in big data ecosystems for its scalability and versatility across various data processing tasks.
Visit the following resources to learn more:
- [@official@ApacheSpark](https://spark.apache.org/documentation.html)
- [@article@Spark By Examples](https://sparkbyexamples.com)
- [@feed@Explore top posts about Apache Spark](https://app.daily.dev/tags/spark?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# APIs
# APIs and Data Collection
Application Programming Interfaces, better known as APIs, play a fundamental role in the work of data engineers, particularly in the process of data collection. APIs are sets of protocols, routines, and tools that enable different software applications to communicate with each other. An API allows developers to interact with a service or platform through a defined set of rules and endpoints, enabling data exchange and functionality use without needing to understand the underlying code. In data engineering, APIs are used extensively to collect, exchange, and manipulate data from different sources in a secure and efficient manner.
Visit the following resources to learn more:
- [@article@What is an API?](https://aws.amazon.com/what-is/api/)
- [@article@A Beginner's Guide to APIs](https://www.postman.com/what-is-an-api/)

View File

@@ -1 +1,10 @@
# ArgoCD
# ArgoCD
Argo CD is a continuous delivery tool for Kubernetes that is based on the GitOps methodology. It is used to automate the deployment and management of cloud-native applications by continuously synchronizing the desired application state with the actual application state in the production environment. In an Argo CD workflow, changes to the application are made by committing code or configuration changes to a Git repository. Argo CD monitors the repository and automatically deploys the changes to the production environment using a continuous delivery pipeline. The pipeline is triggered by changes to the Git repository and is responsible for building, testing, and deploying the changes to the production environment. Argo CD is designed to be a simple and efficient way to manage cloud-native applications, as it allows developers to make changes to the system using familiar tools and processes and it provides a clear and auditable history of all changes to the system. It is often used in conjunction with tools such as Helm to automate the deployment and management of cloud-native applications.
Visit the following resources to learn more:
- [@official@Argo CD - Argo Project](https://argo-cd.readthedocs.io/en/stable/)
- [@video@ArgoCD Tutorial for Beginners](https://www.youtube.com/watch?v=MeU5_k9ssrs)
- [@video@What is ArgoCD](https://www.youtube.com/watch?v=p-kAqxuJNik)
- [@feed@Explore top posts about ArgoCD](https://app.daily.dev/tags/argocd?ref=roadmapsh)

View File

@@ -1 +1,10 @@
# Async vs Sync Communication
# Async vs Sync Communication
Synchronous and asynchronous data refer to different approaches in data transmission and processing. **Synchronous** ingestion is a process where the system waits for a response from the data source before proceeding. In contrast, **asynchronous** ingestion is a process where data is ingested without waiting for a response from the data source. Normally, data is queued in a buffer and sent in batches for efficiency.
Each approach has its benefits and drawbacks, and the choice depends on the specific requirements of the data ingestion process and the business needs.
Visit the following resources to learn more:
- [@article@Synchronous And Asynchronous Data Transmission: The Differences And How to Use Them](https://www.computer.org/publications/tech-news/trends/synchronous-asynchronous-data-transmission)
- [@article@Synchronous vs Asynchronous Communication: Whats the Difference?](https://www.getguru.com/reference/synchronous-vs-asynchronous-communication)

View File

@@ -1 +1,9 @@
# Aurora DB
# Aurora DB
Amazon Aurora (Aurora) is a fully managed relational database engine that's compatible with MySQL and PostgreSQL. Aurora includes a high-performance storage subsystem. Its MySQL- and PostgreSQL-compatible database engines are customized to take advantage of that fast distributed storage. The underlying storage grows automatically as needed. Aurora also automates and standardizes database clustering and replication, which are typically among the most challenging aspects of database configuration and administration.
Visit the following resources to learn more:
- [@official@SAmazon Aurora](https://aws.amazon.com/rds/aurora/)
- [@article@SAmazon Aurora: What It Is, How It Works, and How to Get Started](https://www.datacamp.com/tutorial/amazon-aurora)

View File

@@ -1 +1,8 @@
# Authentication vs Authorization
# Authentication vs Authorization
Authentication and authorization are popular terms in modern computer systems that often confuse people. **Authentication** is the process of confirming the identity of a user or a device (i.e., an entity). During the authentication process, an entity usually relies on some proof to authenticate itself, i.e. an authentication factor. In contrast to authentication, **authorization** refers to the process of verifying what resources entities (users or devices) can access, or what actions they can perform, i.e., their access rights.
Visit the following resources to learn more:
- [@roadmap.sh@Basic Authentication](https://roadmap.sh/guides/basic-authentication)
- [@article@What is Authentication vs Authorization?](https://auth0.com/intro-to-iam/authentication-vs-authorization)

View File

@@ -1 +1,11 @@
# AWS CDK
# AWS CDK
The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework used to provision cloud infrastructure resources in a safe, repeatable manner through AWS CloudFormation. AWS CDK offers the flexibility to write infrastructure as code in popular languages like Python, Java, Go, and C#.
Visit the following resources to learn more:
- [@official@AWS CDK](https://aws.amazon.com/cdk/)
- [@official@AWS CDK Documentation](https://docs.aws.amazon.com/cdk/index.html)
- [@course@AWS CDK Crash Course for Beginners](https://www.youtube.com/watch?v=D4Asp5g4fp8)
- [@opensource@AWS CDK Examples](https://github.com/aws-samples/aws-cdk-examples)
- [@feed@Explore top posts about AWS](https://app.daily.dev/tags/aws?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# AWS EKS
# EKS
Amazon Elastic Kubernetes Service (EKS) is a managed service that simplifies the deployment, management, and scaling of containerized applications using Kubernetes, an open-source container orchestration platform. EKS manages the Kubernetes control plane for the user, making it easy to run Kubernetes applications without the operational overhead of maintaining the Kubernetes control plane. With EKS, you can leverage AWS services such as Auto Scaling Groups, Elastic Load Balancer, and Route 53 for resilient and scalable application infrastructure. Additionally, EKS can support Spot and On-Demand instances use, and includes integrations with AWS App Mesh service and AWS Fargate for serverless compute.
Visit the following resources to learn more:
- [@official@Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/)
- [@official@Concepts of Amazon EKS](https://docs.aws.amazon.com/eks/)

View File

@@ -1 +1,10 @@
# AWS SNS
# AWS SNS
Amazon Simple Notification Service (Amazon SNS) is a web service that makes it easy to set up, operate, and send notifications from the cloud. It provides developers with a highly scalable, flexible, and cost-effective capability to publish messages from an application and immediately deliver them to subscribers or other applications. It is designed to make web-scale computing easier for developers. Amazon SNS follows the “publish-subscribe” (pub-sub) messaging paradigm, with notifications being delivered to clients using a “push” mechanism that eliminates the need to periodically check or “poll” for new information and updates. With simple APIs requiring minimal up-front development effort, no maintenance or management overhead and pay-as-you-go pricing, Amazon SNS gives developers an easy mechanism to incorporate a powerful notification system with their applications.
Visit the following resources to learn more:
- [@official@Amazon Simple Notification Service (SNS) ](http://aws.amazon.com/sns/)
- [@official@Send Fanout Event Notifications](https://aws.amazon.com/getting-started/hands-on/send-fanout-event-notifications/)
- [@article@What is Pub/Sub Messaging?](https://aws.amazon.com/what-is/pub-sub-messaging/)

View File

@@ -1 +1,10 @@
# AWS SQS
# AWS SQS
Amazon Simple Queue Service (Amazon SQS) offers a secure, durable, and available hosted queue that lets you integrate and decouple distributed software systems and components. Amazon SQS offers common constructs such as dead-letter queues and cost allocation tags. It provides a generic web services API that you can access using any programming language that the AWS SDK supports.
Visit the following resources to learn more:
- [@official@Amazon Simple Queue Service](https://aws.amazon.com/sqs/)
- [@official@What is Amazon Simple Queue Service?](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html)
- [@article@Amazon Simple Queue Service (SQS): A Comprehensive Tutorial](https://www.datacamp.com/tutorial/amazon-sqs)

View File

@@ -1 +1,9 @@
# Azure Blob Storage
# Azure Blob Storage
Azure Blob Storage is Microsoft's object storage solution for the cloud. “Blob” stands for Binary Large Object, a term used to describe storage for unstructured data like text, images, and video. Azure Blob Storage is Microsoft Azures solution for storing these blobs in the cloud. It offers flexible storage—you only pay based on your usage. Depending on the access speed you need for your data, you can choose from various storage tiers (hot, cool, and archive). Being cloud-based, it is scalable, secure, and easy to manage.
Visit the following resources to learn more:
- [@official@Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs)
- [@official@Introduction to Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction)
- [@video@A Beginners Guide to Azure Blob Storage](https://www.youtube.com/watch?v=ah1XqItWkuc&t=300s)

View File

@@ -1 +1,10 @@
# Azure SQL Database
# Azure SQL Database
Azure SQL Database is a fully managed Platform as a Service (PaaS) offering. It abstracts the underlying infrastructure, enabling developers to focus on building and deploying applications without worrying about database maintenance tasks.
Visit the following resources to learn more:
- [@official@Azure SQL Database](https://azure.microsoft.com/en-us/products/azure-sql/database)
- [@official@What is Azure SQL Database?](https://learn.microsoft.com/en-us/azure/azure-sql/database/sql-database-paas-overview?view=azuresql)
- [@article@Azure SQL Database: Step-by-Step Setup and Management](https://www.datacamp.com/tutorial/azure-sql-database)
- [@video@Azure SQL for Beginners](https://www.youtube.com/playlist?list=PLlrxD0HtieHi5c9-i_Dnxw9vxBY-TqaeN)

View File

@@ -1 +1,9 @@
# Azure Virtual Machines
# Azure Virtual Machines
Azure Virtual Machines (VMs) enable virtualization without requiring hardware investments. They provide customizable environments for development, testing, and cloud applications so you can run different operating systems like Ubuntu on a Windows host based on your needs. One of the key advantages of Azure VMs is the pay-as-you-go pricing model. It allows you to scale resources up or down as needed, ensuring cost efficiency without wasting resources.
Visit the following resources to learn more:
- [@official@Azure Virtual Machines](https://azure.microsoft.com/en-us/products/virtual-machines)
- [@official@Virtual Machines in Azure](https://learn.microsoft.com/en-us/azure/virtual-machines/overview)
- [@video@AVirtual Machines in Azure | Beginner's Guide](https://www.youtube.com/watch?v=_abaWXoQFZU)

View File

@@ -1 +1,9 @@
# Batch
# Batch
Batch processing is a method in which large volumes of collected data are processed in chunks or batches. This approach is especially effective for resource-intensive jobs, repetitive tasks, and managing extensive datasets where real-time processing isnt required. It is ideal for applications like data warehousing, ETL (Extract, Transform, Load), and large-scale reporting. Data batch processing is mainly automated, requiring minimal human interaction once the process is set up. Tasks are predefined, and the system executes them according to a scheduled timeline, typically during off-peak hours when computing resources are readily available.
Visit the following resources to learn more:
- [@article@What is Batch Processing?](https://aws.amazon.com/what-is/batch-processing/)
- [@article@Batch And Streaming Demystified For Unification](https://towardsdatascience.com/batch-and-streaming-demystified-for-unification-dee0b48f921d/)

View File

@@ -1 +1,15 @@
# Best Practices
# Best Practices
1. **Ensure Reliability.** A robust messaging system must guarantee that messages arent lost, even during node failures or network issues. This means using acknowledgments, replication across multiple brokers, and durable storage on disk. These measures ensure that producers and consumers can recover seamlessly without data loss when something goes wrong.
2. **Design for Scalability.** Scalability should be baked in from the start. Partition topics strategically to distribute load across brokers and consumer groups, enabling horizontal scaling.
3. **Maintain Message Ordering.** For systems that depend on message sequence, ensure ordering within partitions and design producers to consistently route related messages to the same partition.
4. **Secure Communication.** Messaging queues often carry sensitive data, so encrypt messages both in transit and at rest. Implement authentication techniques to ensure only trusted clients can publish or consume, and enforce authorization rules to limit access to specific topics or operations.
6. **Monitor & Alert.** Continuous visibility into your messaging system is essential. Track metrics such as message lag, throughput, consumer group health, and broker disk usage. Set alerts for abnormal patterns, like growing lag or dropped connections, so you can respond before they affect downstream systems.
Visit the following resources to learn more:
- [@article@Best Practices for Message Queue Architecture](https://abhishek-patel.medium.com/best-practices-for-message-queue-architecture-f69d47e3565)

View File

@@ -1 +1,12 @@
# Big Data Tools
# Big Data Tools
Big data tools are specialized software and platforms designed to handle the massive volume, velocity, and variety of data that traditional data processing tools cannot effectively manage. These tools provide the infrastructure, frameworks, and capabilities to process, analyze, and extract meaningful knowledge from vast datasets. They are essential for modern data-driven organizations seeking to gain insights, make informed decisions, and achieve a competitive advantage.
Hadoop and Spark are two of the most prominent frameworks in big data they handle the processing of large-scale data in very different ways. While Hadoop can be credited with democratizing the distributed computing paradigm through a robust storage system called HDFS and a computational model called MapReduce, Spark is changing the game with its in-memory architecture and flexible programming model.
Visit the following resources to learn more:
- [@article@What is Big Data?](https://cloud.google.com/learn/what-is-big-data?hl=en)
- [@article@Hadoop vs Spark: Which Big Data Framework Is Right For You?](https://www.datacamp.com/blog/hadoop-vs-spark)
- [@video@introduction to Big Data with Spark and Hadoop](http://youtube.com/watch?v=vHlwg4ciCsI&t=80s&ab_channel=freeCodeAcademy)

View File

@@ -1 +1,8 @@
# BigTable
# BigTable
Bigtable is a high-performance, scalable database that excels at capturing, processing, and analyzing data in real-time. It aggregates data as it's written, providing immediate insights into user behavior, A/B testing results, and engagement metrics. This real-time capability also fuels AI/ML models for interactive applications. Bigtable integrates seamlessly with both Dataflow, enriching streaming pipelines with low-latency lookups, and BigQuery, enabling real-time serving of analytics in user facing application and ad-hoc querying on the same data.
Visit the following resources to learn more:
- [@official@Bigtable: Fast, Flexible NoSQL](https://cloud.google.com/bigtable?hl=en#scale-your-latency-sensitive-applications-with-the-nosql-pioneer)
- [@article@Google Bigtable](https://www.techtarget.com/searchdatamanagement/definition/Google-BigTable)

View File

@@ -1 +1,11 @@
# Business Intelligence
# Business Intelligence
Business intelligence encompasses a set of techniques and technologies to transform raw data into meaningful insights that drive strategic decision-making within an organization. BI tools enable business users to access different types of data, historical and current, third-party and in-house, as well as semistructured data and unstructured data such as social media. Users can analyze this information to gain insights into how the business is performing and what it should do next.
BI platforms traditionally rely on data warehouses for their baseline information. The strength of a data warehouse is that it aggregates data from multiple data sources into one central system to support business data analytics and reporting. BI presents the results to the user in the form of reports, charts and maps, which might be displayed through a dashboard.
Visit the following resources to learn more:
- [@article@What is business intelligence (BI)?](https://www.ibm.com/think/topics/business-intelligence)
- [@article@Business intelligence: A complete overview](https://www.tableau.com/business-intelligence/what-is-business-intelligence)
- [@video@What is business intelligence?](https://www.youtube.com/watch?v=l98-BcB3UIE)

View File

@@ -1 +1,10 @@
# CAP Theorem
# CAP Theorem
The CAP Theorem, also known as Brewer's Theorem, is a fundamental principle in distributed database systems. It states that in a distributed system, it's impossible to simultaneously guarantee all three of the following properties: Consistency (all nodes see the same data at the same time), Availability (every request receives a response, without guarantee that it contains the most recent version of the data), and Partition tolerance (the system continues to operate despite network failures between nodes). According to the theorem, a distributed system can only strongly provide two of these three guarantees at any given time. This principle guides the design and architecture of distributed systems, influencing decisions on data consistency models, replication strategies, and failure handling. Understanding the CAP Theorem is crucial for designing robust, scalable distributed systems and for choosing appropriate database solutions for specific use cases in distributed computing environments.
Visit the following resources to learn more:
- [@article@What is CAP Theorem?](https://www.bmc.com/blogs/cap-theorem/)
- [@article@An Illustrated Proof of the CAP Theorem](https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/)
- [@article@CAP Theorem and its applications in NoSQL Databases](https://www.ibm.com/uk-en/cloud/learn/cap-theorem)
- [@video@What is CAP Theorem?](https://www.youtube.com/watch?v=_RbsFXWRZ10)

View File

@@ -1 +1,10 @@
# Cassandra
# Cassandra
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of structured data across multiple commodity servers. It provides high availability with no single point of failure, offering linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure. Cassandra uses a masterless ring architecture, where all nodes are equal, allowing for easy data distribution and replication. It supports flexible data models and can handle both unstructured and structured data. Cassandra excels in write-heavy environments and is particularly suitable for applications requiring high throughput and low latency. Its data model is based on wide column stores, offering a more complex structure than key-value stores. Widely used in big data applications, Cassandra is known for its ability to handle massive datasets while maintaining performance and reliability.
Visit the following resources to learn more:
- [@official@Apache Cassandra](https://cassandra.apache.org/_/index.html)
- [article@Cassandra - Quick Guide](https://www.tutorialspoint.com/cassandra/cassandra_quick_guide.htm)
- [@video@Apache Cassandra - Course for Beginners](https://www.youtube.com/watch?v=J-cSy5MeMOA)
- [@feed@Explore top posts about Backend Development](https://app.daily.dev/tags/backend?ref=roadmapsh)

View File

@@ -1 +1,10 @@
# Census
# Census
Census is a reverse ETL platform that synchronizes data from a data warehouse to various business applications and SaaS apps like Salesforce and Hubspot. It's a crucial part of the modern data stack, enabling businesses to operationalize their data by making it available in the tools where teams work, like CRMs, marketing platforms, and more.
Visit the following resources to learn more:
- [@official@Census](https://www.getcensus.com/reverse-etl)
- [@official@Census Documentation](https://developers.getcensus.com/getting-started/introduction)
- [@article@A starter guide to reverse ETL with Census](https://www.getcensus.com/blog/starter-guide-for-first-time-census-users)
- [@video@How to "Reverse ETL" with Census](https://www.youtube.com/watch?v=XkS7DQFHzbA)

View File

@@ -1 +1,16 @@
# Choosing the Right Technologies
# Choosing the Right Technologies
The data engineering ecosystem is rapidly expanding, and selecting the right technologies for your use case can be challenging. Below you can find some considerations for choosing data technologies across the data engineering lifecycle:
- **Team size and capabilities.** Your team's size will determine the amount of bandwidth your team can dedicate to complex solutions. For small teams, try to stick to simple solutions and technologies your team is familiar with.
- **Interoperability**. When choosing a technology or system, youll need to ensure that it interacts and operates smoothly with other technologies.
- **Cost optimization and business value,** Consider direct and indirect costs of a technology and the opportunity cost of choosing some technologies over others.
- **Location** Companies have many options when it comes to choosing where to run their technology stack, including cloud providers, on-premises systems, hybrid clouds, and multicloud.
- **Build versus buy**. Depending on your needs and capabilities, you can either invest in building your own technologies, implement open-source solutions, or purchase proprietary solutions and services.
- **Server versus serverless**. Depending on your needs, you may prefer server-based setups, where developers manage servers, or serverless systems, which translates the server management to cloud providers, allowing developers to focus solely on writing code.
Visit the following resources to learn more:
- [@article@Build hybrid and multicloud architectures using Google Cloud](https://cloud.google.com/architecture/hybrid-multicloud-patterns)
- [@article@The Unfulfilled Promise of Serverless](https://www.lastweekinaws.com/blog/the-unfulfilled-promise-of-serverless/)
- [@book@Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)

View File

@@ -1 +1,11 @@
# CI/CD
# CI / CD
**Continuous Integration** is a software development method where team members integrate their work at least once daily. An automated build checks every integration to detect errors in this method. In Continuous Integration, the software is built and tested immediately after a code commit. In a large project with many developers, commits are made many times during the day. With each commit, code is built and tested.
**Continuous Delivery** is a software engineering method in which a team develops software products in a short cycle. It ensures that software can be easily released at any time. The main aim of continuous delivery is to build, test, and release software with good speed and frequency. It helps reduce the cost, time, and risk of delivering changes by allowing for frequent updates in production.
Visit the following resources to learn more:
- [@article@What is CI/CD? Continuous Integration and Continuous Delivery](https://www.guru99.com/continuous-integration.html)
- [@article@Continuous Integration vs Delivery vs Deployment](https://www.guru99.com/continuous-integration-vs-delivery-vs-deployment.html)
- [@article@CI/CD Pipeline: Learn with Example](https://www.guru99.com/ci-cd-pipeline.html)

View File

@@ -1 +1,10 @@
# Circle CI
# CircleCI
CircleCI is a CI/CD service that can be integrated with GitHub, BitBucket and GitLab repositories. The service that can be used as a SaaS offering or self-managed using your own resources.
Visit the following resources to learn more:
- [@official@CircleCI](https://circleci.com/)
- [@official@CircleCI Documentation](https://circleci.com/docs)
- [@official@Configuration Tutorial](https://circleci.com/docs/config-intro)
- [@feed@Explore top posts about CI/CD](https://app.daily.dev/tags/cicd?ref=roadmapsh)

View File

@@ -1 +1,15 @@
# Cloud Architectures
# Cloud Architectures
Cloud architecture refers to how various cloud technology components, such as hardware, virtual resources, software capabilities, and virtual network systems interact and connect to create cloud computing environments. Cloud architecture dictates how components are integrated so that you can pool, share, and scale resources over a network. It acts as a blueprint that defines the best way to strategically combine resources to build a cloud environment for a specific business need.
Cloud architecture components can included, among others:
- A frontend platform
- A backend platform
- A cloud-based delivery model
- A network (internet, intranet, or intercloud)
Visit the following resources to learn more:
- [@article@What is cloud architecture? - Google](https://cloud.google.com/learn/what-is-cloud-architecture)
- [@video@WWhat is Cloud Architecture and Common Models?](https://www.youtube.com/watch?v=zTP-bx495hU)

View File

@@ -1 +1,9 @@
# Cloud Computing
# Cloud Computing
**Cloud Computing** refers to the delivery of computing services over the internet rather than using local servers or personal devices. These services include servers, storage, databases, networking, software, analytics, and intelligence. Cloud Computing enables faster innovation, flexible resources, and economies of scale. There are various types of cloud computing such as public clouds, private clouds, and hybrids clouds. Furthermore, it's divided into different services like Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). These services differ mainly in the level of control an organization has over their data and infrastructures.
Learn more from the following resources:
- [@article@Cloud Computing - IBM](https://www.ibm.com/think/topics/cloud-computing)
- [@article@What is Cloud Computing? - Azure](https://azure.microsoft.com/en-gb/resources/cloud-computing-dictionary/what-is-cloud-computing)
- [@video@What is Cloud Computing? - Amazon Web Services](https://www.youtube.com/watch?v=mxT233EdY5c)

View File

@@ -1 +1,9 @@
# Cloud SQL (Database)
# Cloud SQL (Database)
Google Cloud SQL is a fully-managed, cost-effective and scalable database service that makes it easy to set-up, maintain, manage and administer MySQL, PostgreSQL, and SQL Server databases in the cloud. Hosted on Google Cloud Platform, Cloud SQL provides a database infrastructure for applications running anywhere.
Visit the following resources to learn more:
- [@official@Cloud SQL](https://cloud.google.com/sql)
- [@official@Cloud SQL overview](https://cloud.google.com/sql/docs/introduction)
- [@course@Cloud SQL](https://www.cloudskillsboost.google/course_templates/701)

View File

@@ -1 +1,6 @@
# Cluster Computing Basics
# Cluster Computing Basics
Cluster computing is the process of using multiple computing nodes, called clusters, to increase processing power for solving complex problems, such as Big Data analytics and AI model training. These tasks require parallel processing of millions of data points for complex classification and prediction tasks. Cluster computing technology coordinates multiple computing nodes, each with its own CPUs, GPUs, and internal memory, to work together on the same data processing task. Applications on cluster computing infrastructure run as if on a single machine and are unaware of the underlying system complexities.

View File

@@ -1 +1,5 @@
# Cluster Management Tools
# Cluster Management Tools
Cluster management software maximizes the work that a cluster of computers can perform. A cluster manager balances workload to reduce bottlenecks, monitors the health of the elements of the cluster, and manages failover when an element fails. A cluster manager can also help a system administrator to perform administration tasks on elements in the cluster.
Some of the most popular Cluster Management Tools are Kubernetes and Apache Hadoop YARN.

View File

@@ -1 +1,9 @@
# Column
# Column
A columnar database is a type of No-SQL database that stores data by columns instead of by rows. In a traditional SQL database, all the information for one record is stored together, but in a columnar database, all the values for a single column are stored together. This makes it much faster to read and analyze large amounts of data, especially when you only need a few columns instead of the whole record. For example, if you want to quickly find the average sales price from millions of rows, a columnar database can scan just the "price" column instead of every piece of data. This design is often used in data warehouses and analytics systems because it speeds up queries and saves storage space through better compression.
Visit the following resources to learn more:
- [@article@What are columnar databases? Here are 35 examples.](https://www.tinybird.co/blog-posts/what-is-a-columnar-database)
- [@article@Columnar Databases](https://www.techtarget.com/searchdatamanagement/definition/columnar-database)
- [@video@WWhat is a Columnar Database? (vs. Row-oriented Database)](https://www.youtube.com/watch?v=1MnvuNg33pA)

View File

@@ -1 +1,11 @@
# Compute Engine (Compute)
# Compute Engine (Compute)
Compute Engine is a computing and hosting service that lets you create and run virtual machines on Google infrastructure. Compute Engine offers scale, performance, and value that lets you easily launch large compute clusters on Google's infrastructure. There are no upfront investments, and you can run thousands of virtual CPUs on a system that offers quick, consistent performance. You can configure and control Compute Engine resources using the Google Cloud console, the Google Cloud CLI, or using a REST-based API. You can also use a variety of programming languages to run Compute Engine, including Python, Go, and Java.
Visit the following resources to learn more:
- [@official@Compute Engine overview](https://cloud.google.com/compute/docs/overview)
- [@course@The Basics of Google Cloud Compute](https://www.cloudskillsboost.google/course_templates/754)
- [@video@WCompute Engine in a minute](https://www.youtube.com/watch?v=IuK4gQeHRcI)

View File

@@ -1 +1,14 @@
# Containers & Orchestration
# Containers & Orchestration
**Containers** are lightweight, portable, and isolated environments that package applications and their dependencies, enabling consistent deployment across different computing environments. They encapsulate software code, runtime, system tools, libraries, and settings, ensuring that the application runs the same regardless of where it's deployed. Containers share the host operating system's kernel, making them more efficient than traditional virtual machines.
**Orchestration** refers to the automated coordination and management of complex IT systems. It involves combining multiple automated tasks and processes into a single workflow to achieve a specific goal. Orchestration is one of the key components of any software development process and it should never be avoided nor preferred over manual configuration. As an automation practice, orchestration helps to remove the chance of human error from the different steps of the data engineering lifecycle. This is all to ensure efficient resource utilization and consistency.
Visit the following resources to learn more:
- [@article@What are Containers?](https://cloud.google.com/learn/what-are-containers)
- [@article@Containers - The New Stack](https://thenewstack.io/category/containers/)
- [@article@An Introduction to Data Orchestration: Process and Benefits](https://www.datacamp.com/blog/introduction-to-data-orchestration-process-and-benefits)
- [@article@What is Container Orchestration?](https://www.redhat.com/en/topics/containers/what-is-container-orchestration)
- [@video@What are Containers?](https://www.youtube.com/playlist?list=PLawsLZMfND4nz-WDBZIj8-nbzGFD4S9oz)
- [@video@Why You Need Data Orchestration](https://www.youtube.com/watch?v=ZtlS5-G-gng)

View File

@@ -1 +1,11 @@
# CosmosDB
# CosmosDB
Azure Cosmos DB is a native No-SQL database service and vector database for working with the document data model. It can arbitrarily store native JSON documents with flexible schema. Data is indexed automatically and is available for query using a flavor of the SQL query language designed for JSON data. It also supports vector search. You can access the API using SDKs for popular frameworks such as.NET, Python, Java, and Node.js.
Visit the following resources to learn more:
- [@official@What are Containers?](https://azure.microsoft.com/en-us/products/cosmos-db#FAQ)
- [@official@CAzure Cosmos DB - Database for the AI Era](https://learn.microsoft.com/en-us/azure/cosmos-db/introduction)
- [@article@CAzure Cosmos DB: A Global-Scale NoSQL Cloud Database](https://www.datacamp.com/tutorial/azure-cosmos-db)
- [@video@What is Azure Cosmos DB?](https://www.youtube.com/watch?v=hBY2YcaIOQM&)

View File

@@ -1 +1,9 @@
# CouchDB
# CouchDB
Apache CouchDB is an open source NoSQL document database that collects and stores data in JSON-based document formats. Unlike relational databases, CouchDB uses a schema-free data model, which simplifies record management across various computing devices, mobile phones and web browsers. In CouchDB, each document is uniquely named in the database, and CouchDB provides a RESTful HTTP API for reading and updating (add, edit, delete) database documents. Documents are the primary unit of data in CouchDB and consist of any number of fields and attachments.
Visit the following resources to learn more:
- [@official@CouchDB](hhttps://couchdb.apache.org/)
- [@official@CouchDB Documentation](https://docs.couchdb.org/en/stable/intro/overview.html)
- [@article@What is CouchDB?](https://www.ibm.com/think/topics/couchdb)

View File

@@ -1 +1,17 @@
# Data Analytics
# Data Analytics
Data Analytics involves extracting meaningful insights from raw data to drive decision-making processes. It includes a wide range of techniques and disciplines ranging from the simple data compilation to advanced algorithms and statistical analysis. Data analysts, as ambassadors of this domain, employ these techniques to answer various questions:
- Descriptive Analytics *(what happened in the past?)*
- Diagnostic Analytics *(why did it happened in the past?)*
- Predictive Analytics *(what will happen in the future?)*
- Prescriptive Analytics *(how can we make it happen?)*
Visit the following resources to learn more:
- [@article@The 4 Types of Data Analysis: Ultimate Guide](https://careerfoundry.com/en/blog/data-analytics/different-types-of-data-analysis/)
- [@article@What is Data Analysis? An Expert Guide With Examples](https://www.datacamp.com/blog/what-is-data-analysis-expert-guide)
- [@course@Introduction to Data Analytics](https://www.coursera.org/learn/introduction-to-data-analytics)
- [@video@Descriptive vs Diagnostic vs Predictive vs Prescriptive Analytics: What's the Difference?](https://www.youtube.com/watch?v=QoEpC7jUb9k)
- [@video@Types of Data Analytics](https://www.youtube.com/watch?v=lsZnSgxMwBA)

View File

@@ -1 +1,14 @@
# Data Collection Considerations
# Data Collection Considerations
Before designing the technology archecture to collect and store data, you should consider the following factors:
- **Bounded versus unbounded**. Bounded data has defined start and end points, forming a finite, complete dataset, like the daily sales report. Unbounded data has no predefined limits in time or scope, flowing continuously and potentially indefinitely, such as user interaction events or real-time sensor data. The distinction is critical in data processing, where bounded data is suitable for batch processing, and unbounded data is processed in stream processing or real-time systems.
- **Frequency.** Collection processes can be batch, micro-batch, or real-time, depending on the frequency you need to store the data.
- **Synchronous versus asynchronous.** Synchronous ingestion is a process where the system waits for a response from the data source before proceeding. In contrast, asynchronous ingestion is a process where data is ingested without waiting for a response from the data source. Each approach has its benefits and drawbacks, and the choice depends on the specific requirements of the data ingestion process and the business needs.
- **Throughput and scalability.** As data demands grow, you will need scalable ingestion solutions to keep pace. Scalable data ingestion pipelines ensure that systems can handle increasing data volumes without compromising performance. Without scalable ingestion, data pipelines face challenges like bottlenecks and data loss. Bottlenecks occur when components can't process data fast enough, leading to delays and reduced throughput. Data loss happens when systems are overwhelmed, causing valuable information to be discarded or corrupted.
- **Reliability and durability.** Data reliability in the ingestion phase means ensuring that the acquired data from various sources is accurate, consistent, and trustworthy as it enters the data pipeline. Durability entails making sure that data isnt lost or corrupted during the data collection process.
Visit the following resources to learn more:
- [@book@Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)

View File

@@ -1 +1,16 @@
# Data Engineering Lifecycle
# Data Engineering Lifecycle
The data engineering lifecycle encompasses the entire process of transforming raw data into a useful end product. It involves several stages, each with specific roles and responsibilities. This lifecycle ensures that data is handled efficiently and effectively, from its initial generation to its final consumption.
It involves 4 steps:
1. Data Generation: Collecting data from various source systems.
2. Data Storage: Safely storing data for future processing and analysis.
3. Data Ingestion: Transforming and bringing data into a centralized system.
4. Data Data Serving: Providing data to end-users for decision-making and operational purposes.
Visit the following resources to learn more:
- [@article@Data Engineering Lifecycle](hhttps://medium.com/towards-data-engineering/data-engineering-lifecycle-d1e7ee81632e)
- [@video@Getting Into Data Engineering](https://www.youtube.com/watch?v=hZu_87l62J4)
- [@book@Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)

View File

@@ -1 +1,16 @@
# Data Engineering Lifecycle
# Data Engineering Lifecycle
The data engineering lifecycle encompasses the entire process of transforming raw data into a useful end product. It involves several stages, each with specific roles and responsibilities. This lifecycle ensures that data is handled efficiently and effectively, from its initial generation to its final consumption.
It involves 4 steps:
1. Data Generation: Collecting data from various source systems.
2. Data Storage: Safely storing data for future processing and analysis.
3. Data Ingestion: Transforming and bringing data into a centralized system.
4. Data Data Serving: Providing data to end-users for decision-making and operational purposes.
Visit the following resources to learn more:
- [@article@Data Engineering Lifecycle](hhttps://medium.com/towards-data-engineering/data-engineering-lifecycle-d1e7ee81632e)
- [@video@Getting Into Data Engineering](https://www.youtube.com/watch?v=hZu_87l62J4)
- [@book@Fundamentals of Data Engineering](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/)

View File

@@ -1 +1,8 @@
# Data Engineering vs Data Science
# Data Engineering vs Data Science
Data engineering and data science are distinct but complementary roles within the field of data. Data engineering focuses on building and maintaining the infrastructure for data collection, storage, and processing, essentially creating the systems that make data available for downstream users. On the other hand, data science professionals, like data analysts and data scientists, uses that data to extract insights, build predictive models, and ultimately inform decision-making.
Visit the following resources to learn more:
- [@article@Data Scientist vs Data Engineer](https://www.datacamp.com/blog/data-scientist-vs-data-engineer)
- [@video@Should You Be a Data Scientist, Analyst or Engineer?](https://www.youtube.com/watch?v=dUnKYhripIE)

View File

@@ -1 +1,9 @@
# Data Fabric
# Data Fabric
A data fabric is a single environment consisting of a unified architecture with services and technologies running on it that architecture that helps a company manage their data. It enables accessing, ingesting, integrating, and sharing data in a environment where the data can be batched or streamed and be in the cloud or on-prem. The ultimate goal of data fabric is to use all your data to gain better insights into your company and make better business decisions. A data fabric includes building blocks such as data pipeline, data access, data lake, data store, data policy, ingestion framework, and data visualization. These building blocks would be used to build platforms or “products” such as a client data integration platform, data hub, governance framework, and a global semantic layer, giving you centralized governance and standardization
Visit the following resources to learn more:
- [@article@What is a data fabric?](http://ibm.com/think/topics/data-fabric)
- [@article@Data Fabric defined](https://www.jamesserra.com/archive/2021/06/data-fabric-defined/)
- [@article@How Data Fabric Can Optimize Data Delivery](https://www.gartner.com/en/data-analytics/topics/data-fabric)

View File

@@ -1 +1,10 @@
# Data Factory (ETL)
# Data Factory (ETL)
Data Factory, most commonly referring to Microsoft's Azure Data Factory, is a cloud-based data integration service that allows you to create, schedule, and orchestrate workflows to move and transform data from various sources into a centralized location for analysis. It provides tools for building Extract, Transform, and Load (ETL) pipelines, enabling businesses to prepare data for analytics, business intelligence, and other data-driven initiatives without extensive coding, thanks to its visual, code-free interface and native connectors.
Learn more from the following resources:
- [@official@What is Azure Data Factory?](https://learn.microsoft.com/en-us/azure/data-factory/introduction)
- [@official@Azure Data Factory Documentation](https://learn.microsoft.com/en-gb/azure/data-factory/)
- [@course@Microsoft Azure - Data Factory](https://www.coursera.org/learn/microsoft-azure---data-factory)
- [@official@Azure Data Factory Documentation](https://learn.microsoft.com/en-gb/azure/data-factory/)

View File

@@ -1 +1,12 @@
# Data Generation
# Data Generation
Data generation refers to the different ways data is produced and generated. Thanks to progress in computing power and storage, as well as technology breakthrough in sensor technology (for example, IoT devices), the number of these so-called source systems is rapidly growing. Data is created in many ways, both analog and digital.
**Analog data** refers to continuous, real-world information that is represented by a range of values. It can take on any value within a given range and is often used to describe physical quantities like temperature or sounds.
By contrast, **digital data** is either created by converting analog data to digital form (eg. images or videos) or is the native product of a digital system, such as logs from a mobile app or syntetic data.
Visit the following resources to learn more:
- [@article@The Concept of Data Generation](https://www.marktechpost.com/2023/02/27/the-concept-of-data-generation/)
- [@video@Analog vs. Digital](https://www.youtube.com/watch?v=zzvglgC5ut0)

View File

@@ -1 +1,10 @@
# Data Hub
# Data Hub
A **data hub** is an architecture that provides a central point for the flow of data between multiple sources and applications, enabling organizations to collect, integrate, and manage data efficiently. Unlike traditional data storage solutions, a data hubs purpose focuses on data integration and accessibility. The design supports real-time data exchange, which makes accessing, analyzing, and acting on the data faster and easier.
A data hub differs from a data warehouse in that it is generally unintegrated and often at different grains. It differs from an operational data store because a data hub does not need to be limited to operational data. A data hub differs from a data lake by homogenizing data and possibly serving data in multiple desired formats, rather than simply storing it in one place, and by adding other value to the data such as de-duplication, quality, security, and a standardized set of query services.
Visit the following resources to learn more:
- [@article@Data hub](https://en.wikipedia.org/wiki/Data_hub)
- [@article@What is a Data Hub? Definition, 7 Key Benefits & Why You Might Need One](https://www.cdata.com/blog/what-is-a-data-hub)

View File

@@ -1 +1,8 @@
# Data Ingestion
# Data Ingestion
Data ingestion is the third step in the data engineering lifecycle. It entails the process of collecting and importing data files from various sources into a database for storage, processing and analysis. The goal of data ingestion is to clean and store data in an accessible and consistent central repository to prepare it for use within the organization.
Visit the following resources to learn more:
- [@article@What is Data Ingestion?](https://www.ibm.com/think/topics/data-ingestion)
- [@article@WData Ingestion](https://www.qlik.com/us/data-ingestion)

View File

@@ -1 +1,8 @@
# Data Interoperability
# Data Interoperability
Data interoperability is the ability of diverse systems and applications to access, exchange, and cooperatively use data in a coordinated and meaningful way, even across organizational boundaries. It ensures that data can flow freely, maintaining its integrity and context, allowing for improved efficiency, collaboration, and decision-making by breaking down data silos. Achieving data interoperability often relies on data standards, metadata, and common data elements to define how data is collected, formatted, and interpreted.
Visit the following resources to learn more:
- [@article@Data Interoperability](https://www.sciencedirect.com/topics/computer-science/data-interoperability)
- [@article@What is Data Interoperability? Exploring the Process and Benefits](https://www.codelessplatforms.com/blog/what-is-data-interoperability/)

View File

@@ -1 +1,8 @@
# Data Lake
# Data lakes
**Data Lakes** are large-scale data repository systems that store raw, untransformed data, in various formats, from multiple sources. They're often used for big data and real-time analytics requirements. Data lakes preserve the original data format and schema which can be modified as necessary.
Learn more from the following resources:
- [@article@Data Lake Definition](https://azure.microsoft.com/en-gb/resources/cloud-computing-dictionary/what-is-a-data-lake)
- [@video@What is a Data Lake?](https://www.youtube.com/watch?v=LxcH6z8TFpI)

View File

@@ -1 +1,8 @@
# Data Lineage
# Data Lineage
**Data Lineage** refers to the life-cycle of data, including its origins, movements, characteristics and quality. It's a critical component in Data Engineering for tracking the journey of data through every process in a pipeline, from raw input to model output. Data lineage helps in maintaining transparency, ensuring compliance, and facilitating data debugging or tracing data related bugs. It provides a clear representation of data sources, transformations, and dependencies thereby aiding in audits, governance, or reproduction of machine learning models.
Learn more from the following resources:
- [@article@What is Data Lineage? - IBM](https://www.ibm.com/topics/data-lineage)
- [@article@What is Data Lineage? - Datacamp](https://www.datacamp.com/blog/data-lineage)

View File

@@ -1 +1,12 @@
# Data Mart
# Data Mart
A data mart is a subset of a data warehouse, focused on a specific business function or department. A data mart is streamlined for quicker querying and a more straightforward setup, catering to the specialized needs of a particular team, or function. Data marts only hold data relevant to a specific department or business unit, enabling quicker access to specific datasets, and simpler management
Visit the following resources to learn more:
- [@article@What is a Data Mart?](https://www.ibm.com/think/topics/data-mart)
- [@article@WData Mart vs Data Warehouse: a Detailed Comparison](https://www.datacamp.com/blog/data-mart-vs-data-warehouse)
- [@video@Data Lake VS Data Warehouse VS Data Marts](https://www.youtube.com/watch?v=w9-WoReNKHk)

View File

@@ -1 +1,9 @@
# Data Masking
# Data Masking
Data masking is a process that creates a copy of real data but replaces sensitive information with false but realistic-looking data, preserving the format and structure of the original data for non-production uses like software testing, training, and development. The goal is to protect confidential information and ensure compliance with data protection regulations by preventing unauthorized access to real sensitive data without compromising the usability of the data for other business functions.
Visit the following resources to learn more:
- [@article@Data masking](https://en.wikipedia.org/wiki/Data_masking)
- [@article@What is data masking?](https://aws.amazon.com/what-is/data-masking/)

View File

@@ -1 +1,9 @@
# Data Mesh
# Data Mesh
A data mesh is a modern approach to data architecture that shifts data management from a centralized model to a decentralized one. It emphasizes domain-oriented ownership, where data management aligns with specific business areas. This alignment makes data operations more scalable and flexible, leveraging the knowledge and expertise of those closest to the data. Data mesh is defined by four principles: data domains, data products, self-serve data platform, and federated computational governance.
Visit the following resources to learn more:
- [@article@What Is a Data Mesh? - AWS](https://aws.amazon.com/what-is/data-mesh)
- [@article@What Is a Data Mesh? - Datacamp](https://www.datacamp.com/blog/data-mesh)
- [@video@Data Mesh Architecture](https://www.datamesh-architecture.com/)

View File

@@ -1 +1,13 @@
# Data Modelling Techniques
# Data Modelling Techniques
A data model is a specification of data structures and business rules. It creates a visual representation of data and illustrates how different data elements are related to each other. Different techniques are employed depending on the complexity of the data and the goals. Below you can find a list with the most common data modelling techniques:
- **Entity-relationship modeling.** It's one of the most common techniques used to represent data. It's based on three elements: Entities (objects or things within the system), relationships (how these entities interact with each other), and attributes (properties of the entities).
- **Dimensional modeling.** Dimensional modeling is widely used in data warehousing and analytics, where data is often represented in terms of facts and dimensions. This technique simplifies complex data by organizing it into a star or snowflake schema.
- **Object-oriented modeling.** Object-oriented modeling is used to represent complex systems, where data and the functions that operate on it are encapsulated as objects. This technique is preferred for modeling applications with complex, interrelated data and behaviors
- **NoSQL modeling.** NoSQL modeling techniques are designed for flexible, schema-less databases. These approaches are often used when data structures are less rigid or evolve over time
Visit the following resources to learn more:
- [@article@7 data modeling techniques and concepts for business](https://www.techtarget.com/searchdatamanagement/tip/7-data-modeling-techniques-and-concepts-for-business)
- [@articleData Modeling Explained: Techniques, Examples, and Best Practices](https://www.datacamp.com/blog/data-modeling)

View File

@@ -1 +1,9 @@
# Data Normalization
# Database Normalization
Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as part of his relational model. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).
Visit the following resources to learn more:
- [@article@What is Normalization in DBMS (SQL)? 1NF, 2NF, 3NF, BCNF Database with Example](https://www.guru99.com/database-normalization.html)
- [@video@Complete guide to Database Normalization in SQL](https://www.youtube.com/watch?v=rBPQ5fg_kiY)
- [@feed@Explore top posts about Database](https://app.daily.dev/tags/database?ref=roadmapsh)

View File

@@ -1 +1,4 @@
# Data Obfuscation
# Data Obfuscation
Statistical data obfuscation involves altering the values of sensitive data in a way that preserves the statistical properties and relationships within the data. It ensures that the masked data maintains the overall distribution, patterns, and correlations of the original data for accurate statistical analysis. Statistical data obfuscation techniques include applying mathematical functions or perturbation algorithms to the data.

View File

@@ -1 +1,8 @@
# Data Pipelines
# Data Pipelines
Data pipelines are a series of automated processes that transport and transform data from various sources to a destination for analysis or storage. They typically involve steps like data extraction, cleaning, transformation, and loading (ETL) into databases, data lakes, or warehouses. Pipelines can handle batch or real-time data, ensuring that large-scale datasets are processed efficiently and consistently. They play a crucial role in ensuring data integrity and enabling businesses to derive insights from raw data for reporting, analytics, or machine learning.
Learn more from the following resources:
- [@article@What is a Data Pipeline? - IBM](https://www.ibm.com/topics/data-pipeline)
- [@video@What are Data Pipelines?](https://www.youtube.com/watch?v=oKixNpz6jNo)

View File

@@ -1 +1,5 @@
# Data Quality
# Data Quality
Ensuring quality involves validating the accuracy, completeness, consistency, and reliability of the data collected from each source. The fact that you do it from one source or multiple is almost irrelevant since the only extra task would be to homogenize the final schema of the data, ensuring deduplication and normalization.
This last part typically includes verifying the credibility of each data source, standardizing formats (like date/time or currency), performing schema alignment, and running profiling to detect anomalies, duplicates, or mismatches before integrating the data for analysis.

View File

@@ -1 +1,7 @@
# Data Quality
# Data Quality
Data quality refers to the degree to which a dataset is accurate, complete, consistent, relevant, and timely, making it fit for its intended use. High-quality data is reliable and trustworthy, enabling better decision-making, accurate analysis, and effective strategies, while poor data quality can lead to flawed insights, wasted resources, and negative consequences for an organization.
Visit the following resources to learn more:
- [@article@What is Data Quality?](https://www.ibm.com/think/topics/data-quality)

View File

@@ -1 +1,4 @@
# Data Serving
# Data Serving
Data serving is the last step in the data engineering process. Once the data is stored in your data architectures and transformed into coherent and useful format, it's time for get value from it. Data serving refers to the different ways data is used by downstream applications and users to create value. There are many ways companies can extract value from data, including training machine learning models, BI Analytics, and reverse ETL.

View File

@@ -1 +1,7 @@
# Data Storage
# Data Storage
Data storage is the process of saving and preserving digital information on various physical or cloud-based media for future retrieval and use. It encompasses the use of technologies and devices like hard drives and cloud platforms to store data.
Visit the following resources to learn more:
- [@article@What is data storage?](https://www.ibm.com/think/topics/data-storage)

View File

@@ -1 +1,13 @@
# Data Structures and Algorithms
# DataStructures and Algorithms
**Data Structures** are primarily used to collect, organize and perform operations on the stored data more effectively. They are essential for designing advanced-level Android applications. Examples include Array, Linked List, Stack, Queue, Hash Map, and Tree.
**Algorithms** are a sequence of instructions or rules for performing a particular task. Algorithms can be used for data searching, sorting, or performing complex business logic. Some commonly used algorithms are Binary Search, Bubble Sort, Selection Sort, etc. A deep understanding of data structures and algorithms is crucial in optimizing the performance and the memory consumption of data pipelines
Visit the following resources to learn more:
- [@video@Data Structures Illustrated](https://www.youtube.com/watch?v=9rhT3P1MDHk\&list=PLkZYeFmDuaN2-KUIv-mvbjfKszIGJ4FaY)
- [@article@Interview Questions about Data Structures](https://www.csharpstar.com/csharp-algorithms/)
- [@video@Intro to Algorithms](https://www.youtube.com/watch?v=rL8X2mlNHPM)
- [@feed@Explore top posts about Algorithms](https://app.daily.dev/tags/algorithms?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# Data Warehouse
# Data Warehouse
**Data Warehouses** are data storage systems which are designed for analyzing, reporting and integrating with transactional systems. The data in a warehouse is clean, consistent, and often transformed to meet wide-range of business requirements. Hence, data warehouses provide structured data but require more processing and management compared to data lakes.
Learn more from the following resources:
- [@article@What Is a Data Warehouse?](https://www.oracle.com/database/what-is-a-data-warehouse/)
- [@video@@hat is a Data Warehouse?](https://www.youtube.com/watch?v=k4tK2ttdSDg)

View File

@@ -1 +1,3 @@
# Data Warehousing Architectures
# Data Warehousing Architectures
Data Warehousing Architectures refers to the different systems and solutions for storing data. Options include traditional data warehouse, data marts, data lakes and data mesh architectures.

View File

@@ -1 +1,17 @@
# Database Fundamentals
# Database fundamentals
A database is a collection of useful data of one or more related organizations structured in a way to make data an asset to the organization. A database management system is a software designed to assist in maintaining and extracting large collections of data in a timely fashion.
A **Relational database** is a type of database that stores and provides access to data points that are related to one another. Relational databases store data in a series of tables.
**NoSQL databases** offer data storage and retrieval that is modelled differently to "traditional" relational databases. NoSQL databases typically focus more on horizontal scaling, eventual consistency, speed and flexibility and is used commonly for big data and real-time streaming applications.
Visit the following resources to learn more:
- [@article@Oracle: What is a Database?](https://www.oracle.com/database/what-is-database/)
- [@article@Prisma.io: What are Databases?](https://www.prisma.io/dataguide/intro/what-are-databases)
- [@article@Intro To Relational Databases](https://www.udacity.com/course/intro-to-relational-databases--ud197)
- [@video@What is Relational Database](https://youtu.be/OqjJjpjDRLc)
- [@article@NoSQL Explained](https://www.mongodb.com/nosql-explained)
- [@video@How do NoSQL Databases work](https://www.youtube.com/watch?v=0buKQHokLK8)
- [@feed@Explore top posts about Database](https://app.daily.dev/tags/database?ref=roadmapsh)

View File

@@ -1 +1,3 @@
# Database
# Database
A database is an organized, structured collection of electronic data that is stored, managed, and accessed via a computer system, usually controlled by a Database Management System (DBMS). Databases organize various types of data, such as words, numbers, images, and videos, allowing users to easily retrieve, update, and modify it for various purposes, from managing customer information to analyzing business processes.

View File

@@ -1 +1,11 @@
# Databricks Delta Lake
# Databricks Delta Lake
Delta Lake is the optimized storage layer that provides the foundation for tables in a lakehouse on Databricks. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.
Visit the following resources to learn more:
- [@official@What is Delta Lake in Databricks?](https://docs.databricks.com/aws/en/delta)
- [@article@Delta Table in Databricks: A Complete Guide](https://www.datacamp.com/tutorial/delta-table-in-databricks)
- [@video@Delta Lake](https://www.databricks.com/resources/demos/videos/lakehouse-platform/delta-lake)
- [@book@The Delta Lake Series — Fundamentals and Performance](https://www.databricks.com/resources/ebook/the-delta-lake-series-fundamentals-performance)

View File

@@ -1 +1,8 @@
# Datadog
# Datadog
Datadog is a monitoring and analytics platform for large-scale applications. It encompasses infrastructure monitoring, application performance monitoring, log management, and user-experience monitoring. Datadog aggregates data across your entire stack with 400+ integrations for troubleshooting, alerting, and graphing.
Visit the following resources to learn more:
- [@official@Datadog](https://www.datadoghq.com/)
- [@official@Datadog Documentation](https://docs.datadoghq.com/)

View File

@@ -1 +1,9 @@
# Dataflow
# Dataflow
Dataflow is a Google Cloud service that provides unified stream and batch data processing at scale. Typical use cases for Dataflow include Data movement,ETL processes, BI dashboarding, and applying ML in real time to streaming data.
Visit the following resources to learn more:
- [@official@Dataflow](https://cloud.google.com/products/dataflow)
- [@article@Dataflow](https://en.wikipedia.org/wiki/Google_Cloud_Dataflow)
- [@video@What is Google Dataflow](https://www.youtube.com/watch?v=KalJ0VuEM7s)

View File

@@ -1 +1,9 @@
# dbt
# dbt
dbt, also known as the data build tool, is designed to simplify the management of data warehouses and transform the data within. This is primarily the T, or transformation, within ELT (or sometimes ETL) processes. It allows for easy transition between data warehouse types, such as Snowflake, BigQuery, Postgres, or DuckDB. dbt also provides the ability to use SQL across teams of multiple users, simplifying interaction. In addition, dbt translates between SQL dialects as appropriate to connect to different data sources and warehouses.
Visit the following resources to learn more:
- [@official@dbt](https://www.getdbt.com/product/what-is-dbt)
- [@official@dbt Documentation](https://docs.getdbt.com/docs/build/documentation)
- [@course@dbt Official Courses](https://learn.getdbt.com/catalog)

View File

@@ -1 +1,15 @@
# Declarative vs Imperative
# Declarative vs Imperative
When it comes to Infrastructure as Code (IaC), there are two fundamental styles: imperative and declarative.
In **imperative IaC**, you specify a list of steps the IaC tool should follow to provision a new resource. You tell your IaC tool how to create each environment using a sequence of command imperatives. Imperative IaC can offer more flexibility as it allows you to dictate each step. However, this can result in increased complexity. Popular imperative IaC tools are Chef and Puppet
In **declarative IaC**, you specify the name and properties of the infrastructure resources you wish to provision, and then the IaC tool figures out how to achieve that end result on its own. You declare to your IaC tool what you want, but not how to get there. Declarative IaC, while less flexible, tends to be simpler and more manageable. Terraform is the most popular declarative IaC tool
Visit the following resources to learn more:
- [@article@Infrastructure as Code: From Imperative to Declarative and Back Again](https://thenewstack.io/infrastructure-as-code-from-imperative-to-declarative-and-back-again/)
- [@article@Declarative vs Imperative Programming for Infrastructure as Code (IaC)](https://www.copado.com/resources/blog/declarative-vs-imperative-programming-for-infrastructure-as-code-iac)

View File

@@ -1 +1,7 @@
# Distributed File Systems
# Distributed File Systems
A Distributed File System (DFS) allows multiple computers to access and share files across a network as if they were stored on a single local machine. It distributes data across multiple servers, enhancing accessibility and data redundancy. This enables users to access files from various locations and devices, promoting collaboration and data availability.
Visit the following resources to learn more:
- [@article@What is a Distributed File System (DFS)? A Complete Guide](http://starwindsoftware.com/blog/what-is-a-distributed-file-system-dfs-a-complete-guide/)

View File

@@ -1 +1,11 @@
# Docker
# Docker
Docker is an open-source platform that automates the deployment, scaling, and management of applications using containerization technology. It enables developers to package applications with all their dependencies into standardized units called containers, ensuring consistent behavior across different environments. Docker provides a lightweight alternative to full machine virtualization, using OS-level virtualization to run multiple isolated systems on a single host. Its ecosystem includes tools for building, sharing, and running containers, such as Docker Engine, Docker Hub, and Docker Compose. Docker has become integral to modern DevOps practices, facilitating microservices architectures, continuous integration/deployment pipelines, and efficient resource utilization in both development and production environments.
Visit the following resources to learn more:
- [@roadmap@Visit Dedicated Docker Roadmap](https://roadmap.sh/docker)
- [@official@Docker Documentation](https://docs.docker.com/)
- [@video@Docker Tutorial](https://www.youtube.com/watch?v=RqTEHSBrYFw)
- [@video@Docker simplified in 55 seconds](https://youtu.be/vP_4DlOH1G4)
- [@feed@Explore top posts about Docker](https://app.daily.dev/tags/docker?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# Document
# Document
**Document Databases are a type of No-SQL databases that store data in JSON, BSON, or XML formats, allowing for flexible, semi-structured and hierarchical data structures. These databases are characterized by their dynamic schema, scalability through distribution, and ability to intuitively map data models to application code. Popular examples include MongoDB, which allows for easy storage and retrieval of varied data types without requiring a rigid, predefined schema.
Visit the following resources to learn more:
- [@article@What is a Document Database?](https://www.mongodb.com/resources/basics/databases/document-databases)
- [@article@HDocument-oriented database](https://en.wikipedia.org/wiki/Document-oriented_database)

View File

@@ -1 +1,7 @@
# DynamoDB
# DynamoDB
Amazon DynamoDB is a fully managed NoSQL database solution that provides fast and predictable performance with seamless scalability. It is a key-value and document database that delivers single-digit millisecond performance at any scale. DynamoDB can handle more than 10 trillion requests per day and support peaks of more than 20 million requests per second. It maintains high durability of data via automatic replication across three different zones in an Amazon defined region.
Visit the following resources to learn more:
- [@official@Amazon DynamoDB](https://aws.amazon.com/dynamodb/)

View File

@@ -1 +1,10 @@
# ECPA
# ECPA
The California Consumer Privacy Act (CCPA) is a California state law enacted in 2020 that protects and enforces the rights of Californians regarding the privacy of consumers personal information (PI).
Visit the following resources to learn more:
- [@official@California Consumer Privacy Act (CCPA)](https://oag.ca.gov/privacy/ccpa)
- [@article@What is the California Consumer Privacy Act (CCPA)?](https://www.ibm.com/think/topics/ccpa-compliance)
- [@video@What is the California Consumer Privacy Act? | CCPA Explained?](https://www.youtube.com/watch?v=dpzsAgrDAO4)

View File

@@ -1 +1,10 @@
# ElasticSearch
# Elasticsearch
Elastic search at its core is a document-oriented search engine. It is a document based database that lets you INSERT, DELETE , RETRIEVE and even perform analytics on the saved records. But, Elastic Search is unlike any other general purpose database you have worked with, in the past. It's essentially a search engine and offers an arsenal of features you can use to retrieve the data stored in it, as per your search criteria. And that too, at lightning speeds.
Visit the following resources to learn more:
- [@official@Elasticsearch Website](https://www.elastic.co/elasticsearch/)
- [@official@Elasticsearch Documentation](https://www.elastic.co/guide/index.html)
- [@video@What is Elasticsearch](https://www.youtube.com/watch?v=ZP0NmfyfsoM)
- [@feed@Explore top posts about ELK](https://app.daily.dev/tags/elk?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# Encryption
# Encryption
Encryption is used to protect data from being stolen, changed, or compromised and works by scrambling data into a secret code that can only be unlocked with a unique digital key. Encrypted data can be protected while at rest on computers or in transit between them, or while being processed, regardless of whether those computers are located on-premises or are remote cloud servers.
Visit the following resources to learn more:
- [@article@Whay is Encryption?](https://cloud.google.com/learn/what-is-encryption)
- [@video@Whay is Encryption?](https://www.youtube.com/watch?v=9chKCUQ8_VQ)

View File

@@ -1 +1,8 @@
# End-to-End Testing
# End-to-End Testing
End-to-end or (E2E) testing is a form of testing used to assert your entire application works as expected from start to finish or "end-to-end". E2E testing differs from unit testing in that it is completely decoupled from the underlying implementation details of your code. It is typically used to validate an application in a way that mimics the way a user would interact with it.
Visit the following resources to learn more:
- [@article@End to End Testing](https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/)
- [@article@End to End Testing: Importance, Process, Best Practices & Frameworks](https://testgrid.io/blog/end-to-end-testing-a-detailed-guide/)

View File

@@ -1 +1,7 @@
# Environmental Management
# Environmental Management
Environmental management, or Environment as Code (EaC) takes the concept of Infrastructure as Code (IaC) one step further. EaC applies DevOps principles to manage and automate entire software environments—including infrastructure, applications, and configurations—using code, making them reproducible, versionable, and reliable. It extends IaC by focusing not just on the underlying servers and networks but on the complete, connected system of services and applications that run on top of it. This approach helps increase efficiency, speeds up deployments, and provides a consistent, auditable process for creating and managing development, testing, and production environments.
Visit the following resources to learn more:
- [@article@EWhat Is Environment as Code (EaaC)?](https://www.bunnyshell.com/blog/what-is-environment-as-code-eaac/)

View File

@@ -1 +1,11 @@
# ETL vs Reverse ETL
# ETL vs Reverse ETL
ETL (Extract, Transform, Load) is a key process in data warehousing, enabling the integration of data from multiple sources into a centralized database.
Reverse ETL emerged as organizations recognized that their carefully curated data warehouses, while excellent for analysis, created a new form of data silo that prevented operational teams from accessing valuable insights. This methodology addresses the critical gap between analytical insights and operational execution by systematically moving processed data from centralized repositories back to the operational systems where business teams interact with customers and manage daily operations.
Visit the following resources to learn more:
- [@article@What is ETL?](https://www.snowflake.com/guides/what-etl)
- [@article@ETL vs Reverse ETL vs Data Activation](https://airbyte.com/data-engineering-resources/etl-vs-reverse-etl-vs-data-activation)
- [@article@ETL vs Reverse ETL: An Overview, Key Differences, & Use Cases](https://portable.io/learn/etl-vs-reverse-etl)

View File

@@ -1 +1,12 @@
# EU AI Act
# EU AI Act
he Artificial Intelligence Act of the European Union, also known as the EU AI Act, is a comprehensive regulatory framework that is established to ensure safety and that fundamental human rights are upheld in the use of AI technologies. It governs the development and/or use of AI in the European Union. The act takes a risk-based approach to regulation, applying different rules to AI systems according to the risk they pose.
Considered the world's first comprehensive regulatory framework for AI, the EU AI Act prohibits some AI uses outright and implements strict governance, risk management and transparency requirements for others.
Visit the following resources to learn more:
- [@official@The EU AI Act Explorer](https://artificialintelligenceact.eu/ai-act-explorer/)
- [@article@AI Act - European Commission](https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai)
- [@article@Artificial Intelligence Act](https://en.wikipedia.org/wiki/Artificial_Intelligence_Act)
- [@video@The EU AI Act Explained](https://www.youtube.com/watch?v=s_rxOnCt3HQ)

View File

@@ -1 +1,3 @@
# Extract Data
# Extract Data
The first step in ETL processes involves extract data from data sources to a staging area. Data can come in various types and formats, from SQL or NoSQL databases and plan text to image and video files.

View File

@@ -1 +1,9 @@
# Functional Testing
# Functional Testing
Functional testing is a type of software testing that validates the software system against the functional requirements/specifications. The purpose of functional tests is to test each function of the software application by providing appropriate input and verifying the output against the functional requirements.
Visit the following resources to learn more:
- [@article@What is Functional Testing? Types & Examples](https://www.guru99.com/functional-testing.html)
- [@article@Functional Testing : A Detailed Guide](https://www.browserstack.com/guide/functional-testing)
- [@feed@Explore top posts about Testing](https://app.daily.dev/tags/testing?ref=roadmapsh)

View File

@@ -1 +1,8 @@
# GDPR
# GDPR in API Design
The General Data Protection Regulation (GDPR) is an essential standard in API Design that addresses the storage, transfer, and processing of personal data of individuals within the European Union. With regards to API Design, considerations must be given on how APIs handle, process, and secure the data to conform with GDPR's demands on data privacy and security. This includes requirements for explicit consent, right to erasure, data portability, and privacy by design. Non-compliance with these standards not only leads to hefty fines but may also erode trust from users and clients. As such, understanding the impact and integration of GDPR within API design is pivotal for organizations handling EU residents' data.
Learn more from the following resources:
- [@official@GDPR](https://gdpr-info.eu/)
- [@article@What is GDPR Compliance in Web Application and API Security?](https://probely.com/blog/what-is-gdpr-compliance-in-web-application-and-api-security/)

View File

@@ -1 +1,15 @@
# Git and GitHub
# Git and GitHub
**Git** is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
**GitHub** is a web-based platform that provides hosting for software development and version control using Git. It is widely used by developers and organizations around the world to manage and collaborate on software projects.
Visit the following resources to learn more:
- [@roadmap@Visit Dedicated Git & GitHub Roadmap](https://roadmap.sh/git-github)
- [@official@Git Documentation](https://git-scm.com/)
- [@official@GitHub Documentation](https://docs.github.com/en/get-started/quickstart)
- [@article@Learn Git with Tutorials, News and Tips - Atlassian](https://www.atlassian.com/git)
- [@article@Git Cheat Sheet](https://cs.fyi/guide/git-cheatsheet)
- [@video@What is GitHub?](https://www.youtube.com/watch?v=w3jLJU7DT5E)
- [@video@Git & GitHub Crash Course For Beginners](https://www.youtube.com/watch?v=SWYqp7iY_Tc)

View File

@@ -1 +1,7 @@
# GitHub Actions
# GitHub Actions
GitHub Actions is a CI/CD tool integrated directly into GitHub, allowing developers to automate workflows, such as building, testing, and deploying code directly from their repositories. It uses YAML files to define workflows, which can be triggered by various events like pushes, pull requests, or on a schedule. GitHub Actions supports a wide range of actions and integrations, making it highly customizable for different project needs. It provides a marketplace with reusable workflows and actions contributed by the community. With its seamless integration with GitHub, developers can take advantage of features like matrix builds, secrets management, and environment-specific configurations to streamline and enhance their development and deployment processes.
Learn more from the following resources:
- [@official@GitHub Actions Documentation](https://docs.github.com/en/actions)

View File

@@ -1 +1,12 @@
# GitLab CI
# GitLab CI
GitLab offers a CI/CD service that can be used as a SaaS offering or self-managed using your own resources. You can use GitLab CI with any GitLab hosted repository, or any BitBucket Cloud or GitHub repository in the GitLab Premium self-managed, GitLab Premium SaaS and higher tiers.
Visit the following resources to learn more:
- [@official@GitLab](https://gitlab.com/)
- [@official@GitLab Documentation](https://docs.gitlab.com/)
- [@official@Get Started with GitLab CI](https://docs.gitlab.com/ee/ci/quick_start/)
- [@official@Learn GitLab Tutorials](https://docs.gitlab.com/ee/tutorials/)
- [@official@GitLab CI/CD Examples](https://docs.gitlab.com/ee/ci/examples/)
- [@feed@Explore top posts about GitLab](https://app.daily.dev/tags/gitlab?ref=roadmapsh)

View File

@@ -1 +1,12 @@
# Go
# Go
Go, also known as Golang, is a statically typed, compiled programming language designed by Google. It combines the efficiency of compiled languages with the ease of use of dynamically typed interpreted languages. Go features built-in concurrency support through goroutines and channels, making it well-suited for networked and multicore systems. It has a simple and clean syntax, fast compilation times, and efficient garbage collection. Go's standard library is comprehensive, reducing the need for external dependencies. The language emphasizes simplicity and readability, with features like implicit interfaces and a lack of inheritance. Go is particularly popular for building microservices, web servers, and distributed systems. Its performance, simplicity, and robust tooling make it a favored choice for cloud-native development, DevOps tools, and large-scale backend systems.
Visit the following resources to learn more:
- [@roadmap@Visit Dedicated Go Roadmap](https://roadmap.sh/golang)
- [@official@Go Reference Documentation](https://go.dev/doc/)
- [@article@Go by Example - annotated example programs](https://gobyexample.com/)
- [@article@Go, the Programming Language of the Cloud](https://thenewstack.io/go-the-programming-language-of-the-cloud/)
- [@video@Go Programming – Golang Course with Bonus Projects](https://www.youtube.com/watch?v=un6ZyFkqFKo)
- [@feed@Explore top posts about Golang](https://app.daily.dev/tags/golang?ref=roadmapsh)

View File

@@ -1 +1,11 @@
# Google BigQuery
# Google BigQuery
BigQuery is a managed, serverless data warehouse product by Google, offering scalable analysis over large quantities of data. It is a Platform as a Service (PaaS) that supports querying using a dialect of SQL. BigQuery is NoOps, meaning there is no infrastructure to manage and you don't need a database administrator. BigQuery lets you focus on analyzing data to find meaningful insights while using familiar SQL and built-in machine learning at unmatched price-performance.
Visit the following resources to learn more:
- [@official@BigQuery overview](https://cloud.google.com/bigquery/docs/introduction)
- [@official@From data warehouse to autonomous data and AI platform](https://cloud.google.com/bigquery)
- [@video@What is BigQuery?](https://www.youtube.com/watch?v=d3MDxC_iuaw)

View File

@@ -1 +1,9 @@
# Google Cloud GKE
## GKE - Google Kubernetes Engine
Google Kubernetes Engine (GKE) is a managed Kubernetes service provided by Google Cloud Platform. It allows organizations to deploy, manage, and scale containerized applications using Kubernetes orchestration. GKE automates cluster management tasks, including upgrades, scaling, and security patches, while providing integration with Google Cloud services. It offers features like auto-scaling, load balancing, and private clusters, enabling developers to focus on application development rather than infrastructure management.
Visit the following resources to learn more:
- [@official@GKE](https://cloud.google.com/kubernetes-engine)
- [@video@What is Google Kubernetes Engine (GKE)?](https://www.youtube.com/watch?v=Rl5M1CzgEH4)

View File

@@ -1 +1,10 @@
# Google Cloud Storage
# Google Cloud Storage
Google Cloud Storage (GCS) is a scalable, secure, and durable object storage service within Google Cloud Platform (GCP) designed for storing and retrieving unstructured data of any type or size. It allows users to store data in "buckets" and access it through APIs, web interfaces, or command-line tools for applications, backups, media hosting, and big data analytics. GCS offers different storage classes to optimize costs based on data access frequency, strong security with encryption, and high availability through redundant data storage across multiple locations.
Visit the following resources to learn more:
- [@article@Cloud Storage](https://cloud.google.com/storage)
- [@article@Google Cloud Storage](https://en.wikipedia.org/wiki/Google_Cloud_Storage)
- [@article@Cloud Storage in a minute](https://www.youtube.com/watch?v=wNOs3LlsH6k)

View File

@@ -1 +1,13 @@
# Google Deployment Mgr.
# Google Deployment Mgr.
Google Cloud Deployment Manager is an infrastructure deployment service that automates the creation and management of Google Cloud resources. It provides users with flexible template and configuration files to create deployments that have a variety of Google Cloud services, such as Cloud Storage, Compute Engine, and Cloud SQL, configured to work together.
Important, Google Deployment Manager will reach end of support on 31 December 2025. An alternative to this tool is **Google Infrastructure Manager**. Infrastructure Manager (Infra Manager) automates the deployment and management of Google Cloud infrastructure resources using Terraform. Infra Manager allows users to deploy programmatically to Google Cloud, allowing to use this service rather than maintaining a different toolchain to work with Terraform on Google Cloud.
Visit the following resources to learn more:
- [@official@Infrastructure Manager Overview](https://cloud.google.com/infrastructure-manager/docs/overview)
- [@official@Google Cloud Deployment Manager documentation](https://cloud.google.com/deployment-manager/docs)

View File

@@ -1 +1,12 @@
# Graph
# Graph Databases
In a graph database, each node is a record and each arc is a relationship between two nodes. Graph databases are optimized to represent complex relationships with many foreign keys or many-to-many relationships.
Graphs databases offer high performance for data models with complex relationships, such as a social network. They are relatively new and are not yet widely-used; it might be more difficult to find development tools and resources. Many graphs can only be accessed with REST APIs.
Visit the following resources to learn more:
- [@article@What is a Graph database?](https://aws.amazon.com/nosql/graph/)
- [@article@What is A Graph Database? A Beginner's Guide](https://www.datacamp.com/blog/what-is-a-graph-database)
- [@article@Graph database](https://en.wikipedia.org/wiki/Graph_database)
- [@video@Introduction to NoSQL](https://www.youtube.com/watch?v=qI_g07C_Q5I)

View File

@@ -1 +1,10 @@
# HBase
# HBase
HBase is a column-oriented No-SQL database management system that runs on top of Hadoop Distributed File System (HDFS), a main component of Apache Hadoop. HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases. It is well suited for real-time data processing or random read/write access to large volumes of data. HBase applications are written in Java™ much like a typical Apache MapReduce application.
Visit the following resources to learn more:
- [@official@Apacha HBase?](https://hbase.apache.org/)
- [@article@What is HBase?](https://www.ibm.com/think/topics/hbase)
- [@article@Apache HBase](https://en.wikipedia.org/wiki/Apache_HBase)

Some files were not shown because too many files have changed in this diff Show More