Friday, July 5, 2013

proc summary - BY statement versus CLASS statement

So on the way to writing some impossible code, I ran into the equally strange results using code and false assumptions which I have made for years now.

Mitch and John
As our group works with larger datasets, more and more often, it continues to be important to “worry” about runtimes.
The original problem was intractable. Computing number of trips is non-additive. Computing number of trips for 150 million possible combinations of dimensions, for 100 million trips, is something new.
I tried proc SQL in SAS and failed. Even in a real database, the amount of memory needed is on the order of 5TB! Splitting the input into 100 chunks reduces that requirement to only 50GB of memory, which is tractable.
Mitch, this means that we have reached the edge again. Non-additive measures in this space is right on the edge of possible.


TL;DR (aka short answer)
Using a BY statement does not mean that your code will run faster.
Sometimes a CLASS statement is the only proper answer, esp if it fits in memory.

Long Answer

Version A:

proc summary data=xtrips_01_sort_V     missing  nway  threads ; 
      by n_market  n_products  n_time  n_demos  n_trips    ;
      var  prj_factor dollars_A2  dollars_A4  units_A4  ;
      output  out=perm.xtrips_03_sort_&part.  ( drop=_:   )   n(prj_factor)=trips_A1  sum=trips_A2    dollars_A2  dollars_A4  units_A4  ;
      format _numeric_ comma15. ;
run ;

Version B:

proc summary data=xtrips_01_sort_V     missing  nway  threads ; 
      class n_market  n_products  n_time  n_demos  n_trips    ;
      var  prj_factor dollars_A2  dollars_A4  units_A4  ;
      output  out=perm.xtrips_03_sort_&part.  ( drop=_:   )   n(prj_factor)=trips_A1  sum=trips_A2    dollars_A2  dollars_A4  units_A4  ;
      format _numeric_ comma15. ;
run ;


Facts
input file
270 million records
35gb sas compressed data (which uses more pages than uncompressed? So clearly I should not have compressed on LOL)
Sorted! The dataset was created by a previous proc summary with a CLASS statement, so it is SORTED by the same set of variables (plus a couple of more at the end).

Output file
20 million records
2gb (gah TURN COMPRESSION OFF!!! If you have not character strings, compression isn’t that useful!!)

Assumptions
1)      I have to run this code 100 times. It is critical that it not take an insane amount of time.
2)      Given that the input dataset is sorted, we know that using the by statement will be faster.
3)      We know that using the class statement will use more memory because it will have to store all the results before writing the file out.
4)      So I assumed that Version A would run faster than Version B and probably a LOT faster.

Results and the wakeup call
1)      Version A took from 10-20 hours. I killed it long before it finished.
2)      Version B took 10 minutes. Read that again. 10 minutes.
3)      Yes, version B took 4gbs of memory to run, but so what. We have tons of memory.

Conclusions
1)      Do not assume that you need to sort the data before using proc summary.
2)      If your output dataset from proc summary can “fit” in memory, use a class statement without sorting first

You think I am kidding?
Here is another example:

Input file:            503,012,176 records – (unsorted data – sorting takes about 70 mins on its own!)
Output file:        269,648,824 records – 35gb!
Memory used:  45gb !!!
Time:                    60mins
Here is the insane part, I have to run this process 100 times.
I am looking at 100-150 hours of total running time.

How much time are you wasting in your jobs?
How many times do you run summary in how many jobs over how many days over how many years?
If you do it a lot, maybe you need to invest some time to save some time.
Alas, I am pretty certain the results are the reverse or irrelevant for small/tiny datasets.
Spend an hour testing on your longest running proc sort and proc summary.

No comments:

Post a Comment