SQL joins versus graph traversal on a large scale
For this experiment, we used exactly the same data structures as before; the only difference was the
amount of data.
In MySQL we had 1,000,000 records in the t_user table, and approximately 1,000,000 × 50 =
50,000,000 records in the t_user_friend table. We ran the same four queries against this data set (friends
at depths 2, 3, 4, and 5).
Table 1.3 shows the collected results for the performance of SQL queries in this
case.
Table 1.3.
The execution times for multiple join queries using a MySQL database engine on a data set of 1 million users
| Depth | Execution time (seconds) for 1 million users | Count result |
|---|---|---|
| 2 | 0.016 | ~2,500 |
| 3 | 30.267 | ~125,000 |
| 4 | 1,543.505 | ~600,000 |
| 5 | Not finished | --- |
Comparing these results to the MySQL results for a data set of 1,000 users, you can see that the
performance of the depth 2 query has stayed the same, which can be explained by the design of the
MySQL engine handling table joins efficiently using indexes. Queries at depths 3 and 4 (which use 3
and 4 join operations, respectively) demonstrate much worse results, by at least two orders of
magnitude. The SQL query for all friends at depth 5 did not finish in the hour we ran the script.
Note
To store the large amount of data required for these examples, a significant amount of disk space is
required. To generate the sample data and run examples against it, you’ll need in excess of 10 GB of
disk space available.
These results clearly show that the MySQL relational database is optimized for single join queries,
even on large data sets. The performance of multiple join queries on large data sets degrades
significantly, to the point that some queries are not even executable (for example, friends at depth 5 for
a data set of 1 million users).
Why are relational database queries so slow?
The results in table 1.3 are somewhat expected, given the way join operations work. As we discussed
earlier, each join creates a Cartesian product of all potential combinations of rows, then filters out
those that don’t match the where clause. With 1 million users, the Cartesian product of 5 joins
(equivalent to a query at depth 5) contains a huge number of rows—billions. Way too many zeros to be
readable. Filtering out all the records that don’t match the query is too expensive, such that the SQL
query at depth 5 never finishes in a reasonable time.
We repeated the same experiment with Neo4j traversals. We had 1 million nodes representing users, and
approximately 50 million relationships stored in Neo4j. We ran the same four traversals as in the
previous example, and we got the performance results in table 1.4.
Table 1.4. The execution times for graph traversal using Neo4j on a data set of 1 million users
Depth
2
3
4
5
Execution time (seconds) for 1 million users
0.01
0.168
1.359
2.132
Count result
~2,500
~110,000
~600,000
~800,000
As you can see, the increase in data by a thousand times didn’t significantly affect Neo4j’s performance.
The traversal does get slower as you increase the depth, but the main reason for that is the increasednumber of results that are returned. The performance slows linearly with the increase of the result size,
and it’s predictable even with a larger data set and depth level. In addition, this is at least a hundred
times better than the corresponding MySQL performance.
The main reason for Neo4j’s predictability is the localized nature of the graph traversal; irrespective of
how many nodes and relationships are in the graph, the traversal will only visit ones that are connected
to the starting node, according to the traversal rules. Remember, relational join operations compute
the Cartesian product before discarding irrelevant results, affecting performance exponentially with the
growth of the data set. Neo4j, however, only visits nodes that are relevant to the traversal, so it’s able to
maintain predictable performance regardless of the total data set size. The more nodes the traversal has
to visit, the slower the traversal, as you’ve seen while increasing the traversal depth. But this increase is
linear and is still independent of the total graph size.