SQL joins versus graph traversal on a large scale

For this experiment, we used exactly the same data structures as before; the only difference was the

amount of data.

In MySQL we had 1,000,000 records in the t_user table, and approximately 1,000,000 × 50 =

50,000,000 records in the t_user_friend table. We ran the same four queries against this data set (friends

at depths 2, 3, 4, and 5).

Table 1.3 shows the collected results for the performance of SQL queries in this

case.

Table 1.3.

The execution times for multiple join queries using a MySQL database engine on a data set of 1 million users

Depth Execution time (seconds) for 1 million users Count result
2 0.016 ~2,500
3 30.267 ~125,000
4 1,543.505 ~600,000
5 Not finished ---

Comparing these results to the MySQL results for a data set of 1,000 users, you can see that the

performance of the depth 2 query has stayed the same, which can be explained by the design of the

MySQL engine handling table joins efficiently using indexes. Queries at depths 3 and 4 (which use 3

and 4 join operations, respectively) demonstrate much worse results, by at least two orders of

magnitude. The SQL query for all friends at depth 5 did not finish in the hour we ran the script.

Note

To store the large amount of data required for these examples, a significant amount of disk space is

required. To generate the sample data and run examples against it, you’ll need in excess of 10 GB of

disk space available.

These results clearly show that the MySQL relational database is optimized for single join queries,

even on large data sets. The performance of multiple join queries on large data sets degrades

significantly, to the point that some queries are not even executable (for example, friends at depth 5 for

a data set of 1 million users).

Why are relational database queries so slow?

The results in table 1.3 are somewhat expected, given the way join operations work. As we discussed

earlier, each join creates a Cartesian product of all potential combinations of rows, then filters out

those that don’t match the where clause. With 1 million users, the Cartesian product of 5 joins

(equivalent to a query at depth 5) contains a huge number of rows—billions. Way too many zeros to be

readable. Filtering out all the records that don’t match the query is too expensive, such that the SQL

query at depth 5 never finishes in a reasonable time.

We repeated the same experiment with Neo4j traversals. We had 1 million nodes representing users, and

approximately 50 million relationships stored in Neo4j. We ran the same four traversals as in the

previous example, and we got the performance results in table 1.4.

Table 1.4. The execution times for graph traversal using Neo4j on a data set of 1 million users

Depth

2

3

4

5

Execution time (seconds) for 1 million users

0.01

0.168

1.359

2.132

Count result

~2,500

~110,000

~600,000

~800,000

As you can see, the increase in data by a thousand times didn’t significantly affect Neo4j’s performance.

The traversal does get slower as you increase the depth, but the main reason for that is the increasednumber of results that are returned. The performance slows linearly with the increase of the result size,

and it’s predictable even with a larger data set and depth level. In addition, this is at least a hundred

times better than the corresponding MySQL performance.

The main reason for Neo4j’s predictability is the localized nature of the graph traversal; irrespective of

how many nodes and relationships are in the graph, the traversal will only visit ones that are connected

to the starting node, according to the traversal rules. Remember, relational join operations compute

the Cartesian product before discarding irrelevant results, affecting performance exponentially with the

growth of the data set. Neo4j, however, only visits nodes that are relevant to the traversal, so it’s able to

maintain predictable performance regardless of the total data set size. The more nodes the traversal has

to visit, the slower the traversal, as you’ve seen while increasing the traversal depth. But this increase is

linear and is still independent of the total graph size.

results matching ""

    No results matching ""