Friday, July 19, 2013

Bigotry

South Park did an AMAZING episode a few years back. About racism. Token kept saying that you dont get it. And in the end, Kyle/Stan said "I finally get it. I DONT GET IT."

I will never be able to understand what a gay, black female has gone through. EVER. No amount of analogy and discussion and actively seeing bigotry can ever get your to understand.

I can actively try to understand, I can listen and watch. I can SPOT when people are being bigots.
But I will never understand.

Thursday, July 11, 2013

Scalzi - the evolution of thoughts harassment

this first article was the START of a series of postings on harassment 
scalzi was replying to THIS event http://blog.bcholmes.org/the-readercon-thing/

It became CLEAR that this is something which scalzi has been thinking about and talking about.

which led to THIS
which if NOTHING ELSE was ever said, is just an AMAZING thing
small, but a start 
but nothing with scalzi ends up that way
the follow up questions and answers are always amazing.
but it isnt over yet.
he is TAKING NAMES !!

Scalzi on SWM = easy mode

SWM = easy mode

http://whatever.scalzi.com/2012/05/15/straight-white-male-the-lowest-difficulty-setting-there-is/



Friday, July 5, 2013

proc summary - BY statement versus CLASS statement

So on the way to writing some impossible code, I ran into the equally strange results using code and false assumptions which I have made for years now.

Mitch and John
As our group works with larger datasets, more and more often, it continues to be important to “worry” about runtimes.
The original problem was intractable. Computing number of trips is non-additive. Computing number of trips for 150 million possible combinations of dimensions, for 100 million trips, is something new.
I tried proc SQL in SAS and failed. Even in a real database, the amount of memory needed is on the order of 5TB! Splitting the input into 100 chunks reduces that requirement to only 50GB of memory, which is tractable.
Mitch, this means that we have reached the edge again. Non-additive measures in this space is right on the edge of possible.


TL;DR (aka short answer)
Using a BY statement does not mean that your code will run faster.
Sometimes a CLASS statement is the only proper answer, esp if it fits in memory.

Long Answer

Version A:

proc summary data=xtrips_01_sort_V     missing  nway  threads ; 
      by n_market  n_products  n_time  n_demos  n_trips    ;
      var  prj_factor dollars_A2  dollars_A4  units_A4  ;
      output  out=perm.xtrips_03_sort_&part.  ( drop=_:   )   n(prj_factor)=trips_A1  sum=trips_A2    dollars_A2  dollars_A4  units_A4  ;
      format _numeric_ comma15. ;
run ;

Version B:

proc summary data=xtrips_01_sort_V     missing  nway  threads ; 
      class n_market  n_products  n_time  n_demos  n_trips    ;
      var  prj_factor dollars_A2  dollars_A4  units_A4  ;
      output  out=perm.xtrips_03_sort_&part.  ( drop=_:   )   n(prj_factor)=trips_A1  sum=trips_A2    dollars_A2  dollars_A4  units_A4  ;
      format _numeric_ comma15. ;
run ;


Facts
input file
270 million records
35gb sas compressed data (which uses more pages than uncompressed? So clearly I should not have compressed on LOL)
Sorted! The dataset was created by a previous proc summary with a CLASS statement, so it is SORTED by the same set of variables (plus a couple of more at the end).

Output file
20 million records
2gb (gah TURN COMPRESSION OFF!!! If you have not character strings, compression isn’t that useful!!)

Assumptions
1)      I have to run this code 100 times. It is critical that it not take an insane amount of time.
2)      Given that the input dataset is sorted, we know that using the by statement will be faster.
3)      We know that using the class statement will use more memory because it will have to store all the results before writing the file out.
4)      So I assumed that Version A would run faster than Version B and probably a LOT faster.

Results and the wakeup call
1)      Version A took from 10-20 hours. I killed it long before it finished.
2)      Version B took 10 minutes. Read that again. 10 minutes.
3)      Yes, version B took 4gbs of memory to run, but so what. We have tons of memory.

Conclusions
1)      Do not assume that you need to sort the data before using proc summary.
2)      If your output dataset from proc summary can “fit” in memory, use a class statement without sorting first

You think I am kidding?
Here is another example:

Input file:            503,012,176 records – (unsorted data – sorting takes about 70 mins on its own!)
Output file:        269,648,824 records – 35gb!
Memory used:  45gb !!!
Time:                    60mins
Here is the insane part, I have to run this process 100 times.
I am looking at 100-150 hours of total running time.

How much time are you wasting in your jobs?
How many times do you run summary in how many jobs over how many days over how many years?
If you do it a lot, maybe you need to invest some time to save some time.
Alas, I am pretty certain the results are the reverse or irrelevant for small/tiny datasets.
Spend an hour testing on your longest running proc sort and proc summary.