DISTINCT vs. GROUP BY: Microsoft Access Speed Test (2023)

In certain situations, the DISTINCT and GROUP BY clauses can be used to generate identical results–but one is consistently faster than the other.

I've been doing a lot of research for my upcoming talk on Troubleshooting Query Performance. As part of that research, I came across the following post by Access guru Colin Riddington:

Mythbusters - SELECT DISTINCT vs GROUP BY

Hi allI updated my Mythbusters: WHERE vs HAVING thread yesterdayhttps://www.access-programmers.co.uk/forums/threads/mythbusters-speed-comparison-tests-having-vs-where.302225 When I did so, @CJ_London reminded me of a previous request to test the following: We often see OP’s using aggregate...

Access World Forumsisladogs

DISTINCT vs. GROUP BY: Microsoft Access Speed Test (1)

Colin was responding to a request from user @CJ_London:

We often see OP's using aggregate queries without any aggregation rather than SELECT DISTINCT - do you have an example to compare a SELECT DISTINCT v GROUP BY with perhaps a criteria or two?

Colin was intrigued, so he devised a speed test to compare the two. Before releasing the results, though, he asked for people to predict which clause would win. Colin's gut feeling–which matched my own and most of the respondents–was that the SELECT DISTINCT clause would be faster. It turned out the opposite was true.

GROUP BY was consistently faster than SELECT DISTINCT.

Read on to learn why.

(Video) selective attention test

Reproducing the Results

Before going too far, I wanted to run the test on my computer to see if I would get the same results.

After running the test several times, I found that GROUP BY was consistently about 5 - 10% faster than SELECT DISTINCT:

DISTINCT vs. GROUP BY: Microsoft Access Speed Test (2)
DISTINCT vs. GROUP BY: Microsoft Access Speed Test (3)
DISTINCT vs. GROUP BY: Microsoft Access Speed Test (4)
DISTINCT vs. GROUP BY: Microsoft Access Speed Test (5)

The only test that did not bear this out was when I ran each query only once. I did this as part of my analysis which I will get to later in the article. In fact, when I originally published this article, the final screenshot above was the only one that I included–thus undermining my entire set of findings.

[Tip of the hat to Karl Donaubauer, who notified me privately...and rather diplomatically...that I had just made an ass of myself by writing 1,000+ words about why GROUP BY was faster than DISTINCT when the only proof I posted showed the opposite.]

And that is why you always run performance tests many times–to reduce the impact of external variations during the test (such as other applications temporarily using the processor).

Why is GROUP BY faster than SELECT DISTINCT?

Colin posted the JetShowPlan showplan.out files for the two queries he tested:

DISTINCT vs. GROUP BY: Microsoft Access Speed Test (6)

Colin noted that the results were "almost identical."

The one key difference was in the final line of each output.

  • SELECT DISTINCT: store result in temporary table
  • GROUP BY: 01) Group table 'Postcodes'

Colin concluded his analysis with this final observation:

I'm also surprised by the result & I don't have an explanation for it other than the evidence from JET ShowPlan.

Testing Colin's Theory

Luckily for me, Colin uploaded his testing database, so it was easy for me to pick up where he left off.

(Video) will pistols work if stuck in concrete?

If I have seen further it is by standing on the shoulders of giants.
- Sir Isaac Newton

Based on Colin's consistent results, I hypothesized that the difference in the two results was likely based on the fact that the slower clause was writing its results to a "temporary table" while the faster one was not. But how could I put that to the test?

ProcMon to the Rescue

I broke out my trusty friend, Process Monitor, and set up the following filter:

  • INCLUDE: Process Name is msaccess.exe
  • INCLUDE: Operation is WriteFile
DISTINCT vs. GROUP BY: Microsoft Access Speed Test (7)

I then ran each of Colin's tests with a single loop to minimize the procmon outputs. Minimizing the results still produced a lot of data (such is the nature of procmon). However, I found the results quite compelling (if a bit difficult to decipher for the uninitiated):

DISTINCT vs. GROUP BY: Microsoft Access Speed Test (8)

Deciphering the ProcMon Results

I added coloring to the above screenshot to highlight a few key sections:

  • Pink: SELECT DISTINCT clause (store result in temporary table)
  • Blue: GROUP BY clause (01) Group table 'Postcodes')
  • Gold: the temporary file created to hold the "temporary table"
  • Green: the 0.1 second delay immediately prior to the first WriteFile operation on the temporary file

The gold and green sections could use a bit more detail. Let's discuss those further now.

Gold: The Temporary File

One of the things I love about ProcMon is getting to see how the sausage is made at the lowest levels of the operating system.

These are some of the abstractions that Joel Spolsky wrote about in his foundational article, "The Law of Leaky Abstractions." For instance, writing a file in VBA can be as little as one line of code. At the operating system level, though, the bits are not moved in a single operation. Rather, they are written to the destination file in chunks, based on the block size of the disk. In this case, the block size is 4,096 bytes (4K).

If you look closely, you can see that 147,456 bytes are stored in the JETF1B2.tmp file. Interestingly, the first byte stored is actually stored in the last byte of the fourth block of the file (Offset: 16,383, Length: 1). Then, bytes 0 through 16,384 are written to the file (i.e., the first four 4k blocks).

After that, the next byte stored is the last byte in the 36th and final block of the file (Offset: 147,455, Length: 1). Then bytes 16,385 through 147,456 are written to the file (though several blocks appear to have been skipped as there is a bigger-than-4K gap between Offsets 73,728 and 86,016).

My theory here is that the single-byte WriteFile operations are a way to force the disk to allocate several blocks at once (presumably, in the hopes that those blocks would be physically contiguous on the disk). But I digress.

(Video) Killer Smiles After Father Attacks Him For Killing His Daughter

The key takeaway here is the fact that this .tmp file exists at all.

Since there is no analogous temporary file created for the GROUP BY clause, I believe this likely explains the consistent performance difference between the two clauses.

Green: File Creation Overhead

The total time to write the .tmp file is only 0.012 seconds.

 10:53:30.2340579 PM-10:53:30.2218845 PM--------------------- 00:00:00.0121734 seconds

This alone does not seem to be enough to account for the 5 - 10% difference between the GROUP BY and SELECT DISTINCT clauses from my testing. This leads me to believe that the bulk of the difference–at least in this case–comes from some other overhead involved in creating the file.

If we look at the first WriteFile operation for the .tmp file and compare it to the previous WriteFile operation for the front-end .accdb database, we find a difference of 0.12 seconds.

 10:53:30.2218845 PM-10:53:30.1062746 PM--------------------- 00:00:00.1156099 seconds

I filtered out other operations from these results to maximize the signal to noise ratio. So, the initial overhead required to write the file is likely less than 0.12 seconds. But, I do think it could be contributing as much–or more–to the overall speed difference between GROUP BY and DISTINCT.

Final Thoughts

The GROUP BY clause is faster than the SELECT DISTINCT clause (at least in these tests) because it does not require writing to a temporary file on disk.

While I would hesitate to say that "GROUP BY is faster than SELECT DISTINCT" in every situation, I do feel confident in saying that we can now account for the performance difference between these two clauses (at least in this scenario).

Also, as the difference seems to be based on whether a file is being saved to disk, I would expect the difference to be more pronounced on machines with slow spinning hard drives (5400 RPM) versus those with high-speed SSDs (such as my test rig).

In summary:

(Video) PowerApps GroupBy function and Nested Galleries

  • GROUP BY is slightly faster than SELECT DISTINCT
  • The slower the drive, the bigger the difference
  • This may not be a universal rule (we tested simple cases only)
  • The difference is hardly noticeable to users
  • These tests were done on Access back-end database tables and may not hold for other types of data sources (e.g., SQL Server linked tables)

My final advice:

• All else being equal, use GROUP BY for new development.
• Don't sacrifice readability for the small performance gain.
• Don't change existing queries (it's not worth it).

External references

Mythbusters - SELECT DISTINCT vs GROUP BY

Hi allI updated my Mythbusters: WHERE vs HAVING thread yesterdayhttps://www.access-programmers.co.uk/forums/threads/mythbusters-speed-comparison-tests-having-vs-where.302225 When I did so, @CJ_London reminded me of a previous request to test the following: We often see OP’s using aggregate...

Access World Forumsisladogs

DISTINCT vs. GROUP BY: Microsoft Access Speed Test (9)

Special thanks to Colin Riddington (isladogs) for putting in the hard work of creating a test harness and then publishing it for the world.

Referenced articles

JetShowPlan: A PrimerYou may be familiar with JetShowPlan, but I guarantee you’ve never read an article about it quite like this one.No Longer SetMike Wolfe

Image by Maïlys Jans from Pixabay

UPDATE [2022-03-16]: Added section "Reproducing the Results" and posted additional screenshots to support my overall findings (h/t Karl Donaubauer).

UPDATE [2022-03-16]: Modified my "final advice" to include a note about not sacrificing readability for the slight performance gain (h/t Joakim Dalby).

(Video) Samsung S21 FE 5G : Change These 20 Settings To Use Your Phone Like A PRO

UPDATE [2022-03-16]: Clarified that these results apply to Access tables, not necessarily other types of tables, such as linked SQL Server tables (h/t AHeyne).

FAQs

Is distinct slower than GROUP BY? ›

DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.

What is the performance between distinct and GROUP BY? ›

What is difference between DISTINCT and GROUP BY? A DISTINCT and GROUP BY usually generate the same query plan, so performance should be the same across both query constructs. GROUP BY should be used to apply aggregate operators to each group. If all you need is to remove duplicates then use DISTINCT.

Does distinct affect query performance? ›

Yes, the application needs to compare every record to the "distinct" records cache as it goes. You can improve performance by using an index, particularly on the numeric and date fields.

Does GROUP BY improve query performance? ›

GROUP BY is a powerful statement, but it tends to slow down queries. Over time, my team and I have used it many times and defined SQL indexes to avoid the performance issues introduced by the GROUP BY clause, especially when dealing with large tables.

Does GROUP BY work faster than distinct? ›

GROUP BY is slightly faster than SELECT DISTINCT

The difference is hardly noticeable to users. These tests were done on Access back-end database tables and may not hold for other types of data sources (e.g., SQL Server linked tables)

Does GROUP BY reduce performance? ›

GROUP BY can (again, in some cases) filter out the duplicate rows before performing any of that work. The performance metrics, however, are interesting to compare. The DISTINCT variation took 4X as long, used 4X the CPU, and almost 6X the reads when compared to the GROUP BY variation.

Does SELECT distinct slow down a query? ›

For MAX and MIN , you probably shouldn't ever use DISTINCT because the results will be the same as without DISTINCT , and the DISTINCT function will make your query substantially slower to return results.

Which is faster union or distinct? ›

UNION ALL does return duplicates: this results in a faster query and could be useful for those who want to know what is in both SELECT statements. UNION DISTINCT is used in scenarios when we need unique based on the where conditions in the query.

How do I use distinct instead of GROUP BY? ›

When and where to use GROUP BY and DISTINCT. DISTINCT is used to filter unique records out of the records that satisfy the query criteria. The "GROUP BY" clause is used when you need to group the data and it should be used to apply aggregate operators to each group.

Why is Count distinct so slow? ›

It's slow because the database is iterating over all the logs and all the dashboards, then joining them, then sorting them, all before getting down to real work of grouping and aggregating.

Is distinct faster than ROW_NUMBER? ›

In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach.

Which is faster count or count distinct? ›

A COUNT(DISTINCT key) basically has to pre-sort your data by key and sorting can be (very) expensive, especially on highly-cardinal data (with a lot of distinct values. A simple COUNT(*) just has to count number of rows - no sorting involved, so it will always be faster than COUNT(DISTINCT) .

How do I optimize query speed? ›

Top 10 Tips to Improve SQL Query Performance
  1. Do not use * in select Statment.
  2. Use Exists instead of Sub Query.
  3. Use Proper join instead of subqueries.
  4. Use “Where” instead of “Having” a clause.
  5. Apply index on necessary columns.
  6. For user-defined stored procedures avoid prefixes like “sp_”
Mar 3, 2022

How do I make my query run faster? ›

The way to make a query run faster is to reduce the number of calculations that the software (and therefore hardware) must perform. To do this, you'll need some understanding of how SQL actually makes calculations.

How can I improve my DB query performance? ›

It's vital you optimize your queries for minimum impact on database performance.
  1. Define business requirements first. ...
  2. SELECT fields instead of using SELECT * ...
  3. Avoid SELECT DISTINCT. ...
  4. Create joins with INNER JOIN (not WHERE) ...
  5. Use WHERE instead of HAVING to define filters. ...
  6. Use wildcards at the end of a phrase only.

Is distinct costly in SQL? ›

“DISTINCT”: An Expensive Keyword

This requires: a lot of CPU — comparing large amounts of rows is hard work… loads of memory to store all rows — all rows need to be compared to each other, which means you need them all available in memory (even if only a hash, depending on low level algorithm implementation)

How do you optimize a GROUP BY? ›

1.17 GROUP BY Optimization. The most general way to satisfy a GROUP BY clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any).

Which is faster GROUP BY or partition by? ›

Group By with not be always be faster than Partition by... its more important to understand the semantics of how the work. - Group BY with hashout the keys and then apply distinct on it.. so If you have nested queries or Views then its a never ending story.

Is GROUP BY faster than distinct in SQL Server? ›

With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds.

Do indexes help with GROUP BY? ›

The index helps group records with indexed columns used with the GROUP BY clause. SQL queries with the GROUP BY clause sort the columns included in the clause to find the result.

Does GROUP BY get rid of duplicates? ›

GROUP BY only treats two rows as duplicates if all the column values in both the rows are the same. If even a single column value in either of the row is non-matching, they are treated as unique.

Is SELECT or SELECT distinct faster? ›

Most of the SELECT DISTINCT queries will perform exactly as fast as their simple SELECT counterparts, because the optimizer will do away with the step necessary for eliminating duplicates.

Is using distinct bad? ›

So, is SQL DISTINCT good or bad in removing duplicates in results? The results say that it's good. It's not better or worse than GROUP BY because the plans are the same. But it's a good habit to check the execution plan.

What affects query speed? ›

Query performance also depends on data volume and transaction concurrency. Executing the same query on a table with millions of records requires more time that performing the same operation on the same table with only thousands of records.

Which is faster UNION or join? ›

Union will be faster, as it simply passes the first SELECT statement, and then parses the second SELECT statement and adds the results to the end of the output table.

Why UNION is slower than UNION all? ›

Both UNION and UNION ALL operators combine rows from result sets into a single result set. The UNION operator removes eliminate duplicate rows, whereas the UNION ALL operator does not. Because the UNION ALL operator does not remove duplicate rows, it runs faster than the UNION operator.

Are unions faster than two queries? ›

Preserving performance through UNION

The UNION operation allows us to merge the results of two queries. Since we know that query #1 and query #3 are each significantly faster than query #2, we would expect that the results of the UNION operation will be fast as well.

For what purpose is distinct used? ›

The SELECT DISTINCT statement is used to return only distinct (different) values. Inside a table, a column often contains many duplicate values; and sometimes you only want to list the different (distinct) values.

What is the purpose of using distinct? ›

The DISTINCT keyword in the SELECT clause is used to eliminate duplicate rows and display a unique list of values. In other words, the DISTINCT keyword retrieves unique values from a table. (If you find the above syntax confusing, consider LearnSQL.com's SQL Basics course.

Why do we use distinct? ›

The distinct keyword is used in conjunction with select keyword. It is helpful when there is a need of avoiding duplicate values present in any specific columns/table. When we use distinct keyword only the unique values are fetched.

How do you optimize count distinct? ›

Some things to try: 1) change COUNT(subscriber_id) to COUNT(*) and see if performance improves. 2) try getting rid of COUNT(DISTINCT subscriber_id) and see if performance improves. Run each of the three subqueries you're combining with UNION ALL and see if one of them has poorer performance than the other two.

Why count 1 is faster than count (*)? ›

No, COUNT(*) will not go through the whole table before returning the number of rows, making itself slower than COUNT(1) .

Is count (*) slow? ›

Is COUNT(*) slow in MySQL? TL;DR: COUNT(*) is optimized to be fast, you should use it. You have probably read in a bunch of different places that you shouldn't use SELECT(*) in MySQL when you don't need all the data. SELECT(*) selects all the columns in the table, not just the ones that you might need.

What is the difference between GROUP BY and distinct? ›

GROUP BY lets you use aggregate functions, like AVG , MAX , MIN , SUM , and COUNT . On the other hand DISTINCT just removes duplicates. This will give you one row per department, containing the department name and the sum of all of the amount values in all rows for that department.

What are alternatives to distinct? ›

Some common synonyms of distinct are apparent, clear, evident, manifest, obvious, patent, and plain. While all these words mean "readily perceived or apprehended," distinct implies such sharpness of outline or definition that no unusual effort to see or hear or comprehend is required.

What can I use instead of distinct in SQL? ›

Below are alternate solutions :
  • Remove Duplicates Using Row_Number. ...
  • Remove Duplicates using self Join. ...
  • Remove Duplicates using group By.
Apr 6, 2020

Is count 1 faster than count (*)? ›

There is no difference. "1" is a non-null expression: so it's the same as COUNT(*) . The optimizer recognizes it for what it is: trivial.

Which is the most performant way to get the total number of records from a table? ›

With the help of the SQL count statement, you can get the number of records stored in a table.

Is sum or count faster? ›

Question: What is Faster, SUM or COUNT? Answer: Both are the same.

Does indexing improve query performance? ›

Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.

What are the two query optimization techniques? ›

There are two types of query optimization in DBMS: Cost-Based Optimization and Adaptive Query Optimization.

How do you fix a slow query? ›

Common methods to resolve long-running, CPU-bound queries
  1. Examine the query plan of the query.
  2. Update Statistics.
  3. Identify and apply Missing Indexes. ...
  4. Redesign or rewrite the queries.
  5. Identify and resolve parameter-sensitive plans.
  6. Identify and resolve SARG-ability issues.
Dec 29, 2022

What can cause slow query? ›

Queries can become slow for various reasons ranging from improper index usage to bugs in the storage engine itself. However, in most cases, queries become slow because developers or MySQL database administrators neglect to monitor them and keep an eye on their performance.

Why big query is fast? ›

The architecture forms a massively parallel distributed tree for pushing down a query to the tree and then aggregating the results from the leaves at a blazingly fast speed.

Which database is best for performance? ›

Which is best Database for web applications In 2022?
  • The Oracle. Oracle is the most widely used commercial relational database management system, built-in assembly languages such as C, C++, and Java. ...
  • MySQL. ...
  • MS SQL Server. ...
  • PostgreSQL. ...
  • MongoDB. ...
  • IBM DB2. ...
  • Redis. ...
  • Elasticsearch.

How can I improve my database latency? ›

Use a cache solution

Basically, when you cache a query, the database engine don't repeat the same query the next time it is requested. That, of course, saves time, reduces latency, and improves performance. You can apply your cache at different levels: At the Database engine level.

What is the best performance tuning technique a DBA can use to improve database performance? ›

Increase Memory

One way DBAs tackle sql performance tuning issues is by increasing the memory allocation of existing databases. When SQL databases possess ample memory, boosts in efficiency and performance often follow.

Does distinct slow down a query? ›

For MAX and MIN , you probably shouldn't ever use DISTINCT because the results will be the same as without DISTINCT , and the DISTINCT function will make your query substantially slower to return results.

Why you shouldn't use SELECT distinct? ›

“DISTINCT”: An Expensive Keyword

This requires: a lot of CPU — comparing large amounts of rows is hard work… loads of memory to store all rows — all rows need to be compared to each other, which means you need them all available in memory (even if only a hash, depending on low level algorithm implementation)

Is distinct faster than Row_number? ›

In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach.

Is distinct an expensive operation? ›

As you can see DISTINCT is an expensive CPU (and also memory) intensive operation.

When should I use distinct? ›

The DISTINCT clause is used in the SELECT statement to filter out duplicate rows in the result set. You can use DISTINCT when you select a single column, or when you select multiple columns as we did in our example.

Which is faster count (*) or count ID? ›

SQL COUNT Syntax:

If you use an asterisk, Count calculates the total number of records, including those that contain Null fields. Count(*) is considerably faster than Count([Column Name]). If we use the following example table, we can demonstrate the use of the COUNT function. This will return the result of: 5 records.

Is distinct bad practice? ›

As a general rule, SELECT DISTINCT incurs a fair amount of overhead for the query. Hence, you should avoid it or use it sparingly. The idea of generating duplicate rows using JOIN just to remove them with SELECT DISTINCT is rather reminiscent of Sisyphus pushing a rock up a hill, only to have it roll back down again.

How do you avoid duplicates without using distinct? ›

Below are alternate solutions :
  1. Remove Duplicates Using Row_Number. WITH CTE (Col1, Col2, Col3, DuplicateCount) AS ( SELECT Col1, Col2, Col3, ROW_NUMBER() OVER(PARTITION BY Col1, Col2, Col3 ORDER BY Col1) AS DuplicateCount FROM MyTable ) SELECT * from CTE Where DuplicateCount = 1.
  2. Remove Duplicates using group By.
Apr 6, 2020

Videos

1. 15 Women With The Most Unique Bodies in the World
(MAD LAB)
2. Sonic Boom: Rise of Lyric TV Commercial
(Sonic the Hedgehog)
3. These Barbers Have Crazy Skills. God Level Barbers
(Binge Central)
4. How to use COUNTIF and COUNTIFS in Microsoft Excel
(Kevin Stratvert)
5. Is Jake Paul Cheating? (Jake Paul vs. Tyron Woodley II highlights via Showtime Boxing)
(Omar Raja - ESPN)
6. WE WERE ALL TRICKED!
(Aeroga)
Top Articles
Latest Posts
Article information

Author: Patricia Veum II

Last Updated: 09/04/2023

Views: 6551

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.