## Sunday, September 11, 2011

### Statistic analysis and histogram plotting using gnuplot

Given a data file containing a set of data, count how many datas locate in intervals [a1:a2],[a2:a3]... respectively, then plot the result into a histogram. This a common problem in statistics and exactly what we will do in this article.

Firstly, let us see how to map the data into intervals. There is a function "floor(x)" which return the largest integer not greater than its argument. So function floor(x/dx)*dx will map x into one of the intervals [-n*dx:-(n-1)*dx],[-(n-1)*dx:-(n-2)*dx]...[(n-1)*dx:n*dx].

Now we come to count the data number in each interval. In gnuplot there is a smooth option called "frequency". It makes the data monotonic in x. Points with the same x-value are replaced by a single point having the summed y-values. Using this property, we can count the data numbers in the intervals.

At last we plot our result using boxes plot style.

The main idea have introduced. It is time to write the plotting script.
```reset
n=100 #number of intervals
max=3. #max value
min=-3. #min value
width=(max-min)/n #interval width
#function used to map a value to the intervals
hist(x,width)=width*floor(x/width)+width/2.0
set term png #output terminal and file
set output "histogram.png"
set xrange [min:max]
set yrange [0:]
#to put an empty boundary around the
#data inside an autoscaled graph.
set offset graph 0.05,0.05,0.05,0.0
set xtics min,(max-min)/5,max
set boxwidth width*0.9
set style fill solid 0.5 #fillstyle
set tics out nomirror
set xlabel "x"
set ylabel "Frequency"
#count and plot
plot "data.dat" u (hist(\$1,width)):(1.0) smooth freq w boxes lc rgb"green" notitle
```
We use a data file (download from here) which contains 10000 normally distributed random numbers and get a graph like the follow one.

 statistic histogram plotting using gnuplot

1. you are my hero!

2. hi, i tried the same thing using gnuplot but it says "undefined variable: graph"

then i still continue with the plot and says "all points y value undefined"

thanks.

1. "all points y value indefined" means that all your y points are out of your yrange, you have to set it in order to have them in it...

2. I just had the same mistake. I forgot to set the datafile delimieter ��

3. Hi,Callisto:
1.The script runs well on my computer, I have just confirmed about it. So the first question may be caused by your mistyping.
2."all points y value undefined" may been caused by the gnuplot can not find the data file. So have you put the file data.dat under the working directory?

You may copy the script to a file (for example, plot.gplt), and then copy it and the data file (data.dat) to your working directory. After these are done, run command "load 'plot.gplt'" using gnuplot.

4. i managed to figure it out, just had to remove the word "graph". :)

Would you be able to tell me how to fit a gaussian curve onto the histogram? thank you.

5. Hi,Callisto:
It is a bit hard to fit a Gaussian curve in this problem only using gnuplot, since gnuplot is designed as a plot tool, not a data processing software. Tricks played, the goal may be achieved. May be I will talk about how to do it in a future post.
Now I advice using data processing software to process the data at first. Getting the fitted curve and then plot it on the graph.

1. I'm surprised that you can create so many beautiful plots with Gnuplot using a lot of features, but you do not know the "fit" command.
I see that this comment is quite old and most probably (if you looked after) you found already that fitting in Gnuplot is actually very simple.
It is worth a try.

6. Really cool thing! I never thought that gnuplot could do something like that and it's exactly what I wanted to do. Just a little question is it possible to fit a function (in this case a gaussian) to this histogram?
In any case thanks a lot!

7. Anonymous:
It is possible to use "set table " to export the data to a data file. And then use "fit" command to fit a curve.

8. Thank you so much for your fast answer! I was trying since two hours... Now finally I have a really beautiful graph :) I love gnuplot and your blog!
Greetings from Lyon, yours Daniel

9. Brilliant!

10. Hi there,

First of all, thank you for this blog! I'm trying to make a histogram using the same script that you provided above. the only difference is that the data doesnt seem to be accumulating. Although I have one set of data, it seems to plot 4 different histograms.

Here is what it looks life

Is there any reason why this is so ?

The only difference in my script is that I have normalised the distribution by changing

u (hist(\$1,width)):(1.0)

to

u (hist(\$1,width)):(1.0/(N*width))

where N = number of data points

Help?

R

11. Ray2.0:
The most possible reason is that there are some blank lines in your data file. Examine your data file and delete the lines, and then have a try again. Hope your success！

12. Hi there!

Thanks so much for the reply! You are right. My data file is also 500 000 lines and there were some nans in there. I have another point of query however! Do you know how to plot 3d histograms? I saw an image of this online: a 3d histogram with projections on the different sides of the plot.

I hope this makes sense!

R

13. Ray2.0:
A 3-d histogram is always not necessary and not suggested.
For example, this graph (http://www.photobiology.com/v1/maragoni/img13.jpg) is indeed really a bad one, since the bars shade each other, so that the reader can not get the information the graph is intended to give. And this kind of graph is always suggested to plot as a heatmap (http://flowingdata.com/wp-content/uploads/yapb_cache/nba_heatmap_revised.7sjutbstqyw40kw4o08og084k.2xne1totli0w8s8k0o44cs0wc.th.png).
And for a 3-d histogram like this one (http://cqisignals.com/samples/highres-histogram-3D-chart.png), it gives nothing more than a normal histogram, and only brings risks of misleading (when there is two values nearly the same, in such a plot it is harder (compared to a normal histogram) to decide which one is larger).

14. Hi again,

regarding this example:
http://www.photobiology.com/v1/maragoni/img13.jpg

I did not intend to use 'with boxes' options but linepoints instead. Actually what I have is a list of values for a complex variable, so two columns of real and imaginary values. And I wanted to observe the shape of the distribution function. Furthermore, if I use the kdensity option, perhaps I could get a nice smooth distribution.

I do agree however that the second type of 3D histogram is pretty useless and has only aesthetic merit.

15. Ray2.0:
Plotting a list of complex variable is actually not a 3-d plotting problem. It is two 2-d histogram plotting tasks. So ...

16. Worked beautifully. Thanks a lot.
I love your hanlde too because I speak Chinese.

17. Hi!

thank you very much!!!! Let me ask one question: how did you generate random numbers between [-4,4]. I'm supposed not to use a library function, but one generator provided. I can normalize it between [0, n], but how to proceed to achive [-n,n].

Thank you so much again!

1. Provided now you can generate a random number x uniformly distributed in [0,1], then max*(2*x-1) will be a random number uniformly distributed in range [-max,max].

18. Hi!

Can i use 2 data files and build a stacked histogram with different colors. I have two data files data1.dat and data2.dat. I can make a histogram using ur code with data1.dat. Now on the same plot i want to make the histogram with data2.dat but stacked on top of the first histogram. How can i do it?

Thanks
pc

1. It is always very difficult to process two files at the same time when you plot using gnuplot. It is advised to merge the files together previously. If you use Linux platform command "paste" can be used to merge files.

19. Hi,
I need plot something of this sort http://www.flickr.com/photos/intumyspace/6911907271/
and need to use gnuplot.py can u suggest how can we vary the histogram width and need to display some info in every slot.
Currently I just found this, and trying to figure out how to dynamically plot histograms one after the other rather than plotting at once when whole info is available
http://gnuplot.sourceforge.net/demo/histograms.html

1. To vary the histogram width, the "boxes" plot style is recommended to use. You may refer to this post: http://gnuplot-surprising.blogspot.com/2011/09/plot-histograms-using-boxes.html

20. thanks! this example script has proved incredibly useful

21. Thanks for your article! Very useful

22. Good Article About Statistic analysis and histogram plotting using gnuplot

23. What is "(1.0)" mean in the last line? Can I replace it with a column number?

1. "(1.0)" means value 1.0 . It can not be replaced with a column number.

24. Another question, why it is wrong when I use "set logscale xy"?

1. Are you sure, it is an error caused by "set logscale xy"?

When I use "set logscale y" the histogram plot become flat. I tried another way to plot. First output the number of each column, then plot histogram. This works all right when use logscale.

25. Thank you very much indeed! It was very useful for me! ;)

26. Really great, Yesterday I wasted 15 minutes in doing the same with Libre calc. Thanks for the code.. Its awesome!!!

27. Hi over there. Thanks for your blog. Very useful. However, I slightly modified it for controlling explicitly the number of intervals, etc. For my data set, for the same data limits, when I ask to plot 10 intervals (of 5 units), the subroutine works fine even when I sent to plot relative frequencies. However, when I send to print 5 intervals (of 10 units) I get rather 6 boxes!! do you happen to know why?.

1. If you can give me your plotting-script and data file, I may figure out the problem.

28. thanks a lot!

29. Awesomeness!

30. Wow, it worked in a minute, thanks. Great example.

31. Thank you :)

32. Hi,
Thanks for very useful blog!
could you explain a bit how I can use set table command. I want to fit a density plot to my histogram.
Thanks a lot!

1. when one use command
set table "outfile-name",
then plot and splot command will not actually plot a figure, in stead it will print out a data file with the name you specified.

33. Hi.

Thanks a lot!

I use gnuplot 4.4 patchlevel 0(=V1) and gnuplot 4.2 patchlevel 2(=V2)

When i use your script in V2 - all work pretty.
In V1 - i get error:
"all points y value undefined!"
if i set yrange to [0:100] it's work, but plot is empty - only axes

Thank you.

1. It is a strange problem. The script worked well on my computer even when the gnuplot 4.4.0 is used. Maybe you can restart your gnuplot and then run the script again.

34. Very useful, cheers!

35. 姐姐好厉害。。

36. Great post!

Could you please a little on the functions used here? Also, how to plot the relative frequencies without using any other pre-processing tools?

1. After the first line add a new line "stats 'data.dat' u 1". And modify the last line to "plot "data.dat" u (hist(\$1,width)):(1.0)/STATS_records smooth freq w boxes lc rgb"green" notitle". Then the relative frequencies is plotted.

37. Many thanks for this quick tutorial !!

38. This is very useful, but i have now an other problem, i want do make a normal distribution with this datas, how i can do this?

39. Just great!. Thanks so much!

40. Hi all,
I have used this example and then got this error:
delay.sh: line 7: syntax error near unexpected token `x,width'
./delay.sh: line 7: `hist(x,width)=width*floor(x/width)+width/2.0'

Any one has faced the same problem or knows to solve it please.

41. Hi, thanks for this script. Although it gave me a syntax error, associated with the line 'set offset graph 0.05,0.05,0.05,0.0', I was able to run it successfully after commenting on this line.

42. Very useful script - thank you :-) Any ideas how I would set the y upper bound to be dynamic? (i.e. the max value of the largest bin frequency)

Thanks!

1. It should already be set to be dynamic, and you can always try to leave the yrange line out and see if the result looks good.

43. Thanks, still very useful!

I also encountered the following error.

"all points y value undefined"

This occurred because I used "min=5" instead of "min=5."

44. That piece of code was extremely helpful.
Thank you!

45. Shouldn't there be:

hist(x,width)=width*floor((x-min)/width)+width/2.0+min

hist(x,width)=width*floor(x/width)+width/2.0 ?

For case:
x=10; min=1; max=101; n=10 (width=10)

x should map into interval [1:11]

46. When I ran the script it gave the following error :

plot "data.dat" u (hist(,width)):(1.0) smooth freq w boxes lc rgb"green" notitle
^
line 0: invalid expression

Can anyone help me with this error ?