The Python Oracle

How to create a DataFrame of random integers with Pandas?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Luau

--

Chapters
00:00 How To Create A Dataframe Of Random Integers With Pandas?
00:29 Accepted Answer Score 291
01:10 Answer 2 Score 24
01:36 Answer 3 Score 5
02:33 Thank you

--

Full question
https://stackoverflow.com/questions/3275...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #dataframe #size #shapes

#avk47



ACCEPTED ANSWER

Score 292


numpy.random.randint accepts a third argument (size) , in which you can specify the size of the output array. You can use this to create your DataFrame -

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

Here - np.random.randint(0,100,size=(100, 4)) - creates an output array of size (100,4) with random integer elements between [0,100) .


Demo -

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

which produces:

     A   B   C   D
0   45  88  44  92
1   62  34   2  86
2   85  65  11  31
3   74  43  42  56
4   90  38  34  93
5    0  94  45  10
6   58  23  23  60
..  ..  ..  ..  ..



ANSWER 2

Score 25


The recommended way to create random integers with NumPy these days is to use numpy.random.Generator.integers. (documentation)

import numpy as np
import pandas as pd

rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100, 4)), columns=list('ABCD'))
df
----------------------
      A    B    C    D
 0   58   96   82   24
 1   21    3   35   36
 2   67   79   22   78
 3   81   65   77   94
 4   73    6   70   96
... ...  ...  ...  ...
95   76   32   28   51
96   33   68   54   77
97   76   43   57   43
98   34   64   12   57
99   81   77   32   50
100 rows × 4 columns



ANSWER 3

Score 6


You can also use np.random.Generator.choice.

df = pd.DataFrame(np.random.default_rng().choice(100, size=(100, 4)), columns=['A','B','C','D'])

The advantage of this method over integers is that you can choose from any list / array you want. For example, if you want to generate random sample from [2, 5, 10], then

df = pd.DataFrame(np.random.default_rng().choice([2,5,10], size=(100, 4)), columns=['A','B','C','D'])

You can even associate a probability distribution to sample entries. For example, if you want to choose 2 with p=0.8, and 5 with p=0.2, you can do so by, passing p= argument.

df = pd.DataFrame(np.random.default_rng().choice([2,5], p=[.8,.2], size=(100, 4)), columns=['A','B','C','D'])

Also, with the Generator, choice is as fast as integers and faster than randint.

%timeit pd.DataFrame(np.random.default_rng().choice(100, size=(100_000,4)), columns=[*'ABCD'])
# 3.34 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.DataFrame(np.random.default_rng().integers(0, 100, size=(100_000,4)), columns=[*'ABCD'])
# 3.81 ms ± 708 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.DataFrame(np.random.randint(100, size=(100_000,4)), columns=[*'ABCD'])
# 6.78 ms ± 776 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)