DEV Community

Cover image for Inside SQL Joins
Balaji
Balaji

Posted on

Inside SQL Joins

SQL joins are fundamental to querying databases, allowing users to combine data from multiple tables based on specified conditions. Joins are categorized into two main types: logical joins and physical joins. Logical joins represent the conceptual way in which data from tables is combined, while physical joins refer to the actual implementation of these joins within database systems like RDS (Relational Database Service) or other SQL servers. In today's blog post, we'll unravel the mysteries of SQL joins.

Let's jump in!

Logical Join

There are various types of logical joins in SQL. The two most common are Inner join and Outer join. We use these joins when we need to retrieve data from tables.

Physical Join

Physical joins are implemented inside RDS. The user writes the query using a logical join and RDS uses a physical join to perform the join operations. There are different types of physical joins like
1. Nested Loop Join
2. Hash Join
3. Merge Join and so on

Nested Loop Join

This is a type of join where a smaller table with fewer records is selected and looped through the other table until a match is found. This type of join is available in MySQL, Postgres, and even SQL servers. However, it is not a scalable option for large tables. It is mainly used in cases where the join operator does not use equality.

For example,Geospatial Queries: When dealing with geographic data, you might want to find points within a certain distance of other points. This could involve comparing the distance between every combination of points, which could be achieved with a Nested Loop Join.

SELECT *
FROM cities
JOIN landmarks ON distance(cities.location, landmarks.location) < 100;
Enter fullscreen mode Exit fullscreen mode

Hash Join

Hash join is a method of executing a join using the hash table to find a match record. A hash table is created in memory. If there is a large amount of data and there is not enough memory to store it, then it is written to disk. Hash join is more efficient than Nested Loop join. During execution, RDS builds the in-memory hash table where the rows from the join table are stored using the join attribute as the key. After the execution, the server starts reading the rows from the other table and finds the corresponding row from hash table. This method is commonly used when the join operator uses equality.

Suppose you have an "Employee" table with employee details like ID, name, and department ID, and a "Department" table with department details like ID and name. You want to join these tables to get the department each employee belongs to

SELECT *
FROM Employee
JOIN Department ON Employee.department_id = Department.department_id;
Enter fullscreen mode Exit fullscreen mode

In this example the join condition is based on equality between columns, making it suitable for a hash join. This method is efficient, especially when dealing with large datasets, as it can quickly match records using the hash table. However, as with any join method, it's important to consider the size of the datasets and available memory to ensure optimal performance.

Merge Join

Merge Join is a method used in SQL query execution when the join condition employs an equality operator and both sides of the join are large. This technique relies on sorted data inputs. If there exists an index on the expressions used in the join column, it can be utilized to obtain the sorted data efficiently. However, if the server needs to sort the data explicitly, it's crucial to analyze the indexes and consider optimizing them for improved performance.

Example:
Consider a scenario involving a "Sales" table with sales transactions, including sale ID, customer ID, and sale amount, and a "Customers" table containing customer details like customer ID, name, and location.

SELECT *
FROM Sales
JOIN Customers ON Sales.customer_id = Customers.customer_id;
Enter fullscreen mode Exit fullscreen mode

In this case, both the "Sales" and "Customers" tables are substantial, and the join condition relies on the equality of the "customer_id" column. For an efficient merge join, both input tables need to be sorted by the join column ("customer_id"). If there's no existing index on the "customer_id" column, the server may need to perform additional sorting operations, which could impact performance.

To optimize the merge join, it's advisable to create or modify indexes on the "customer_id" column in both tables. Ensuring proper maintenance and optimization of these indexes can lead to significant improvements in query performance, particularly for queries frequently involving joins based on the "customer_id" column.

By leveraging indexes effectively and ensuring sorted data inputs, merge joins can efficiently handle joins between large tables with equality-based join conditions, contributing to enhanced query performance and overall system efficiency.

Aspect Nested Loop Join Hash Join Merge Join
Join Condition Non-equality Equality Equality
Input Data Size Small to Medium Medium to Large Large
Data Sorting Not required Not required Required
Memory Usage Low Moderate to High Moderate to High
Index Utilization Not a primary concern Beneficial Relies on indexes
Performance(large datasets) Slower Efficient Efficient
Scalability Less scalable Scalable Scalable
Typical Use Cases Small to medium-sized tables Large tables with equality joins Large tables with equality joins

Top comments (0)