More array creation functions
Here I will detail all the functions that are responsible for creating arrays. We already saw np.zeros()
, np.ones()
and np.empty()
. There is also np.full()
which creats an array but instead of filling it with zero or one it fills it with a number you specify in the second argument.
np.eye()
and np.identity()
create an identity matrix. eye
accepts more arguments than identity
.
np.indices()
is useful for the following scenario: Suppose you have an array sample
and a function such as np.sin
which can take several arguments, all the same dimension (or you could concatenate all those arguments to a dimension+1
array which is what np.indices()
returns). But instead of passing completely different arrays as arguments to the function, you want to pass the same array but with rows treated as the first dimension in argument 1, columns treated as the first dimension in argument 2, etc. so that the the first dimension of these resulting arrays come from different axes in the sample
array. You can index the sample
array with the index array generated by np.indices()
to get this result.
A practical use for this function would be to make a 3D plot of a function evaluated on a 3D mesh. These have three coordinates per element and it's useful to run the function on all X, Y and Z axes.
The arguments of np.indices is the shape of the array as a tuple or list.
In [1]: np.indices((3,3))
Out[1]:
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
In [2]: sample = np.linspace(1., 4., 9).reshape(3,3)
In [3]: sample
Out[4]:
array([[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]])
In [4]: sample[np.indices((3,3))]
Out[4]:
array([[[[1. , 1.375, 1.75 ],
[1. , 1.375, 1.75 ],
[1. , 1.375, 1.75 ]],
[[2.125, 2.5 , 2.875],
[2.125, 2.5 , 2.875],
[2.125, 2.5 , 2.875]],
[[3.25 , 3.625, 4. ],
[3.25 , 3.625, 4. ],
[3.25 , 3.625, 4. ]]],
[[[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]],
[[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]],
[[1. , 1.375, 1.75 ],
[2.125, 2.5 , 2.875],
[3.25 , 3.625, 4. ]]]])
Reading arrays from file
Very large arrays are usually stored in files because they would be too big and tedious to write in code. It's possible to read arrays stored in CSV files, HDF5, FITS and image pixels, but not directly into numpy. Other third party libraries are required to import the file and create a numpy array out of that. I hope to cover those in due time.
In particular, h5py reads HDF5 files, Astropy reads FITS files and Pillow reads image files.
Reading arrays from a string
np.genfromtxt()
enables us to read an array from a simple string representing an array in a tabular format.
In [1]: from io import StringIO
In [2]: data = u"1, 2, 3\n4, 5, 6"
...: np.genfromtxt(StringIO(data), delimiter=",")
Out[2]:
array([[1., 2., 3.],
[4., 5., 6.]])
In [3]: print(data)
1, 2, 3
4, 5, 6
The data string looks exactly like the numpy array that was created from it. Also, notice how you can change the delimiter
keyword argument to whatever delimiter your string file uses. By default, the delimiter is any number of whitespace.
You can also remove leading and trailing whitespace with the autostrip
keyword argument:
>>> data = u"1, abc , 2\n 3, xxx, 4"
>>> # Without autostrip
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
array([['1', ' abc ', ' 2'],
# ['3', ' xxx', ' 4']], dtype='<U5')
>>> # With autostrip
>>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)
array([['1', 'abc', '2'],
# ['3', 'xxx', '4']], dtype='<U5') #TODO remove slash
There is also a comments
argument that excludes a range of characters in a line, from a particular character at some point, such as #
, to the end of the line. If this is None, no lines are treated as comments. By default, this is #
, so all hash comments are removed.
>>> data = u"""#
... # Skip me !
... # Skip me too !
... 1, 2
... 3, 4
... 5, 6 #This is the third line of the data
... 7, 8
... # And here comes the last line
... 9, 0
... """
>>> np.genfromtxt(StringIO(data), comments="#", delimiter=",")
array([[1., 2.],
[3., 4.],
[5., 6.],
[7., 8.],
[9., 0.]])
The skip_header
and skip_footer
arguments exclude a certain number of lines at the beginning or end, respectively. Both of these are 0
by default.
In [1]: data = u"\n".join(str(i) for i in range(10))
In [2]: print(data)
0
1
2
3
4
5
6
7
8
9
In [3]: np.genfromtxt(StringIO(data),
...: skip_header=3, skip_footer=5)
Out[3]: array([3., 4.])
use_cols
allows us to select particular columns to be imported into the array. Indices are specified as a tuple, and behave like normal python list indices. In particular, negative indices behave the same way as those in Python, so -1 means the last column, -2 means second to last, etc. You can't select a column more than once.
This can be combined with the names
argument, which gives names to all the columns, to select columns by name. Names and indies can both be mixed in usecols
but at any rate, columns must not be selected more than once.
>>> data = u"1 2 3\n4 5 6"
>>> np.genfromtxt(StringIO(data), usecols=(0, -1))
array([[ 1., 3.],
[ 4., 6.]])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a", "c"))
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a", 2)) # Same as above
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a, c")) # Notice they're all in one string
array([(1.0, 3.0), (4.0, 6.0)],
dtype=[('a', '<f8'), ('c', '<f8')])
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a, 2")) # Fails, 2 is not the name of a column
ValueError: '2' is not in list
>>> np.genfromtxt(StringIO(data),
... names="a, b, c", usecols=("a", 0)) # Fails, column 0 named "a" is imported more than once
ValueError: field 'a' occurs more than once
In this function, dtype
is allowed to be None, which causes np.genfromtxt
to guess the dtype of the elements. But this is a lot slower than specifying the dtype explicitly.
deletechars
will delete the all the characters provided as a string to the deletechars
argument from field names (not elements). By default, the deleted characters are ~!@#$%^&*()-=+~\|]}[{';: /?.>,<
. Again, these are not deleted from elements, with an appropriate dtype:
In [1]: data = u"(, 0, )\n<, ., d>2d"
...: np.genfromtxt(StringIO(data), usecols=(1, 2), dtype='U') # Note: delimiter is whitespace
Out[1]:
array([['0,', ')'],
# ['.,', 'd>2d']], dtype='<U4')
excludelist
will prepend an underscore at the beginning of fields if their names matches one of the values in this list (which by default is None). Only applies to fields, not elements.
In [1]: data = u"(, 0, )\n<, ., return"
...: np.genfromtxt(StringIO(data), usecols=(1, 2), dtype='U', excludelist=[
...: 'return'])
Out[1]:
array([['0,', ')'],
# ['.,', 'return']], dtype='<U6')
The case_sensitive
argument determines the case of the fields, whether it's uppercase (case_sensitive=False
or case_sensitive='upper')
, lowercase (case_sensitive='lower'
) or leave the case alone (case_sensitive=True
, the default value).
Adjusting values during import
Suppose you have a date field or a percentage in your tabular data. Numpy can't convert them by itself so what do you do? Luckily, genfromtxt
has another argument called converters
, and this is a dictionary containing the column name or index as a key, and a function (def or lambda) that takes an element from the column in string form as the only argument (so its argument type is str
) and returns a value corresponding to the columns dtype.
# This converter turns percentages to floats
>>> convertfunc = lambda x: float(x.strip(b"%"))/100.
>>> data = u"1, 2.3%, 45.\n6, 78.9%, 0"
>>> names = ("i", "p", "n")
>>> np.genfromtxt(StringIO(data), delimiter=",", names=names,
... converters={1: convertfunc}) # Convert second column
array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])
Also, something like convert = lambda x: float(x.strip() or -999)
, or anything with <result-evaluation> or <default-value>
in it could be used to provide a default value in case an invalid element is passed to the converter. Converters should never assume that the input data is well-formed.
The missing_values
argument offers a cleaner solution to this. It takes a list of strings, one for each column, that elements in a column that exactly match that column's string should be considered missing. missing_values
can also take a dictionary of column indices/names (or None
to represent all columns) and missing strings instead of a list. missing_values
can even take a single string that represents missing elements in all columns, the entire table. By default, missing_values=None
, so the only value treated as missing is the empty string.
Now, to fill those missing values with default values, we must set the filling_values
argument. It has exactly the same format as missing_values
, but with default values of a column type instead of missing strings. Everything said above about missing_values
applies here too.
In [1]: data = u"2, 2, 3\n1, , 3"
...: np.genfromtxt(StringIO(data), dtype='i', delimiter=",")
Out[1]:
array([[ 2, 2, 3],
[ 1, -1, 3]], dtype=int32)
These are the default filling values for each of the numeric dtypes:
Type | Default Value |
---|---|
bool | False |
int | -1 |
float | np.nan |
complex | np.nan+0j |
Finally, the usemask
argument allows us to inspect which of the elements were labeled missing if you set it to True. By default it's False:
In [1]: data = u"2, 2, 3\n1, , 3"
...: np.genfromtxt(StringIO(data), dtype='i', delimiter=",", usemask=True)
Out[1]:
masked_array(
data=[[2, 2, 3],
[1, --, 3]],
mask=[[False, False, False],
[False, True, False]],
fill_value=999999,
dtype=int32)
This masked_array
is a Python object and the fields listed here can be explored with dot .
notation or getattr()
.
And we're done
Much material was sourced from the numpy manual.
If you see anything incorrect here please let me know so I can fix it.
Image by Arek Socha from Pixabay
Top comments (0)