Sunday, September 11, 2011

Statistic analysis and histogram plotting using gnuplot

Given a data file containing a set of data, count how many datas locate in intervals [a1:a2],[a2:a3]... respectively, then plot the result into a histogram. This a common problem in statistics and exactly what we will do in this article.

Firstly, let us see how to map the data into intervals. There is a function "floor(x)" which return the largest integer not greater than its argument. So function floor(x/dx)*dx will map x into one of the intervals [-n*dx:-(n-1)*dx],[-(n-1)*dx:-(n-2)*dx]...[(n-1)*dx:n*dx].

Now we come to count the data number in each interval. In gnuplot there is a smooth option called "frequency". It makes the data monotonic in x. Points with the same x-value are replaced by a single point having the summed y-values. Using this property, we can count the data numbers in the intervals.

At last we plot our result using boxes plot style.

The main idea have introduced. It is time to write the plotting script.
reset
n=100 #number of intervals
max=3. #max value
min=-3. #min value
width=(max-min)/n #interval width
#function used to map a value to the intervals
hist(x,width)=width*floor(x/width)+width/2.0
set term png #output terminal and file
set output "histogram.png"
set xrange [min:max]
set yrange [0:]
#to put an empty boundary around the
#data inside an autoscaled graph.
set offset graph 0.05,0.05,0.05,0.0
set xtics min,(max-min)/5,max
set boxwidth width*0.9
set style fill solid 0.5 #fillstyle
set tics out nomirror
set xlabel "x"
set ylabel "Frequency"
#count and plot
plot "data.dat" u (hist($1,width)):(1.0) smooth freq w boxes lc rgb"green" notitle
We use a data file (download from here) which contains 10000 normally distributed random numbers and get a graph like the follow one.

statistic histogram plotting using gnuplot

54 comments:

  1. hi, i tried the same thing using gnuplot but it says "undefined variable: graph"

    then i still continue with the plot and says "all points y value undefined"

    thanks.

    ReplyDelete
    Replies
    1. "all points y value indefined" means that all your y points are out of your yrange, you have to set it in order to have them in it...

      Delete
  2. Hi,Callisto:
    1.The script runs well on my computer, I have just confirmed about it. So the first question may be caused by your mistyping.
    2."all points y value undefined" may been caused by the gnuplot can not find the data file. So have you put the file data.dat under the working directory?

    You may copy the script to a file (for example, plot.gplt), and then copy it and the data file (data.dat) to your working directory. After these are done, run command "load 'plot.gplt'" using gnuplot.

    ReplyDelete
  3. i managed to figure it out, just had to remove the word "graph". :)

    Would you be able to tell me how to fit a gaussian curve onto the histogram? thank you.

    ReplyDelete
  4. Hi,Callisto:
    It is a bit hard to fit a Gaussian curve in this problem only using gnuplot, since gnuplot is designed as a plot tool, not a data processing software. Tricks played, the goal may be achieved. May be I will talk about how to do it in a future post.
    Now I advice using data processing software to process the data at first. Getting the fitted curve and then plot it on the graph.

    ReplyDelete
    Replies
    1. I'm surprised that you can create so many beautiful plots with Gnuplot using a lot of features, but you do not know the "fit" command.
      I see that this comment is quite old and most probably (if you looked after) you found already that fitting in Gnuplot is actually very simple.
      It is worth a try.

      Delete
  5. Really cool thing! I never thought that gnuplot could do something like that and it's exactly what I wanted to do. Just a little question is it possible to fit a function (in this case a gaussian) to this histogram?
    In any case thanks a lot!

    ReplyDelete
  6. Anonymous:
    It is possible to use "set table " to export the data to a data file. And then use "fit" command to fit a curve.

    ReplyDelete
  7. Thank you so much for your fast answer! I was trying since two hours... Now finally I have a really beautiful graph :) I love gnuplot and your blog!
    Greetings from Lyon, yours Daniel

    ReplyDelete
  8. Hi there,

    First of all, thank you for this blog! I'm trying to make a histogram using the same script that you provided above. the only difference is that the data doesnt seem to be accumulating. Although I have one set of data, it seems to plot 4 different histograms.

    Here is what it looks life
    https://docs.google.com/document/d/1DLor564g7o-wYYB6vg3arAQRt7d2C9M5E7-h-EnQaoQ/edit

    Is there any reason why this is so ?

    The only difference in my script is that I have normalised the distribution by changing

    u (hist($1,width)):(1.0)

    to

    u (hist($1,width)):(1.0/(N*width))


    where N = number of data points

    Help?

    Thanks in advance!

    R

    ReplyDelete
  9. Ray2.0:
    The most possible reason is that there are some blank lines in your data file. Examine your data file and delete the lines, and then have a try again. Hope your success!

    ReplyDelete
  10. Hi there!

    Thanks so much for the reply! You are right. My data file is also 500 000 lines and there were some nans in there. I have another point of query however! Do you know how to plot 3d histograms? I saw an image of this online: a 3d histogram with projections on the different sides of the plot.

    I hope this makes sense!

    thanks in advance =)

    R

    ReplyDelete
  11. Ray2.0:
    A 3-d histogram is always not necessary and not suggested.
    For example, this graph (http://www.photobiology.com/v1/maragoni/img13.jpg) is indeed really a bad one, since the bars shade each other, so that the reader can not get the information the graph is intended to give. And this kind of graph is always suggested to plot as a heatmap (http://flowingdata.com/wp-content/uploads/yapb_cache/nba_heatmap_revised.7sjutbstqyw40kw4o08og084k.2xne1totli0w8s8k0o44cs0wc.th.png).
    And for a 3-d histogram like this one (http://cqisignals.com/samples/highres-histogram-3D-chart.png), it gives nothing more than a normal histogram, and only brings risks of misleading (when there is two values nearly the same, in such a plot it is harder (compared to a normal histogram) to decide which one is larger).

    ReplyDelete
  12. Hi again,

    regarding this example:
    http://www.photobiology.com/v1/maragoni/img13.jpg

    I did not intend to use 'with boxes' options but linepoints instead. Actually what I have is a list of values for a complex variable, so two columns of real and imaginary values. And I wanted to observe the shape of the distribution function. Furthermore, if I use the kdensity option, perhaps I could get a nice smooth distribution.

    I do agree however that the second type of 3D histogram is pretty useless and has only aesthetic merit.

    ReplyDelete
  13. Ray2.0:
    Plotting a list of complex variable is actually not a 3-d plotting problem. It is two 2-d histogram plotting tasks. So ...

    ReplyDelete
  14. Worked beautifully. Thanks a lot.
    I love your hanlde too because I speak Chinese.

    ReplyDelete
  15. Hi!

    thank you very much!!!! Let me ask one question: how did you generate random numbers between [-4,4]. I'm supposed not to use a library function, but one generator provided. I can normalize it between [0, n], but how to proceed to achive [-n,n].

    Thank you so much again!

    ReplyDelete
    Replies
    1. Provided now you can generate a random number x uniformly distributed in [0,1], then max*(2*x-1) will be a random number uniformly distributed in range [-max,max].

      Delete
  16. Hi!

    Can i use 2 data files and build a stacked histogram with different colors. I have two data files data1.dat and data2.dat. I can make a histogram using ur code with data1.dat. Now on the same plot i want to make the histogram with data2.dat but stacked on top of the first histogram. How can i do it?

    Thanks
    pc

    ReplyDelete
    Replies
    1. It is always very difficult to process two files at the same time when you plot using gnuplot. It is advised to merge the files together previously. If you use Linux platform command "paste" can be used to merge files.

      Delete
  17. Hi,
    I need plot something of this sort http://www.flickr.com/photos/intumyspace/6911907271/
    and need to use gnuplot.py can u suggest how can we vary the histogram width and need to display some info in every slot.
    Currently I just found this, and trying to figure out how to dynamically plot histograms one after the other rather than plotting at once when whole info is available
    http://gnuplot.sourceforge.net/demo/histograms.html
    Thanks for your time

    ReplyDelete
    Replies
    1. To vary the histogram width, the "boxes" plot style is recommended to use. You may refer to this post: http://gnuplot-surprising.blogspot.com/2011/09/plot-histograms-using-boxes.html

      Delete
  18. thanks! this example script has proved incredibly useful

    ReplyDelete
  19. Thanks for your article! Very useful

    ReplyDelete
  20. Good Article About Statistic analysis and histogram plotting using gnuplot

    ReplyDelete
  21. What is "(1.0)" mean in the last line? Can I replace it with a column number?

    ReplyDelete
    Replies
    1. "(1.0)" means value 1.0 . It can not be replaced with a column number.

      Delete
  22. Another question, why it is wrong when I use "set logscale xy"?

    ReplyDelete
    Replies
    1. Are you sure, it is an error caused by "set logscale xy"?

      Delete
    2. Thank you for your reply!
      When I use "set logscale y" the histogram plot become flat. I tried another way to plot. First output the number of each column, then plot histogram. This works all right when use logscale.

      Delete
  23. Thank you very much indeed! It was very useful for me! ;)

    ReplyDelete
  24. Really great, Yesterday I wasted 15 minutes in doing the same with Libre calc. Thanks for the code.. Its awesome!!!

    ReplyDelete
  25. Hi over there. Thanks for your blog. Very useful. However, I slightly modified it for controlling explicitly the number of intervals, etc. For my data set, for the same data limits, when I ask to plot 10 intervals (of 5 units), the subroutine works fine even when I sent to plot relative frequencies. However, when I send to print 5 intervals (of 10 units) I get rather 6 boxes!! do you happen to know why?.

    ReplyDelete
    Replies
    1. If you can give me your plotting-script and data file, I may figure out the problem.

      Delete
  26. Wow, it worked in a minute, thanks. Great example.

    ReplyDelete
  27. Hi,
    Thanks for very useful blog!
    could you explain a bit how I can use set table command. I want to fit a density plot to my histogram.
    Thanks a lot!

    ReplyDelete
    Replies
    1. when one use command
      set table "outfile-name",
      then plot and splot command will not actually plot a figure, in stead it will print out a data file with the name you specified.

      Delete
  28. Hi.

    Thanks a lot!

    I use gnuplot 4.4 patchlevel 0(=V1) and gnuplot 4.2 patchlevel 2(=V2)

    When i use your script in V2 - all work pretty.
    In V1 - i get error:
    "all points y value undefined!"
    if i set yrange to [0:100] it's work, but plot is empty - only axes

    Please help me to solve this problem

    Thank you.

    ReplyDelete
    Replies
    1. It is a strange problem. The script worked well on my computer even when the gnuplot 4.4.0 is used. Maybe you can restart your gnuplot and then run the script again.

      Delete
  29. Very useful, cheers!

    ReplyDelete
  30. 姐姐好厉害。。

    ReplyDelete
  31. Great post!

    Could you please a little on the functions used here? Also, how to plot the relative frequencies without using any other pre-processing tools?

    ReplyDelete
    Replies
    1. After the first line add a new line "stats 'data.dat' u 1". And modify the last line to "plot "data.dat" u (hist($1,width)):(1.0)/STATS_records smooth freq w boxes lc rgb"green" notitle". Then the relative frequencies is plotted.

      Delete
  32. Many thanks for this quick tutorial !!

    ReplyDelete
  33. This is very useful, but i have now an other problem, i want do make a normal distribution with this datas, how i can do this?

    ReplyDelete
  34. Hi all,
    I have used this example and then got this error:
    delay.sh: line 7: syntax error near unexpected token `x,width'
    ./delay.sh: line 7: `hist(x,width)=width*floor(x/width)+width/2.0'

    Any one has faced the same problem or knows to solve it please.

    ReplyDelete
  35. Hi, thanks for this script. Although it gave me a syntax error, associated with the line 'set offset graph 0.05,0.05,0.05,0.0', I was able to run it successfully after commenting on this line.

    ReplyDelete
  36. Very useful script - thank you :-) Any ideas how I would set the y upper bound to be dynamic? (i.e. the max value of the largest bin frequency)

    Thanks!

    ReplyDelete

Creative Commons License
Except as otherwise noted, the content of this page is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.