[P] Modifying open-sourced matrix multiplication kernel

I've spent the past few months optimizing my matrix multiplication CUDA kernel, and finally got near cuBLAS performance on Tesla T4. In the past few weeks I've been trying to fuse all kinds of operations into the matmul kernel, such as reductions, topk search, masked_fill, and the results are looking pretty good. All of the fused kernels are much faster than the seperated versions while using much less memory.

Runtime of fused MinBMM vs. torch.bmm + torch.min

edit: unit of time in this plot should be seconds, not milliseconds

Runtime of fused TopkBMM vs. torch.bmm + torch.topk

Runtime of fused MBMM vs. torch.bmm + torch.masked_fill

I also wrote a blog post about the motivation, applications and some implementation details of these kernels. The source code can be found in this repo.

πŸ‘︎ 192
πŸ’¬︎
πŸ‘€︎ u/DeMorrr
πŸ“…︎ May 27 2021
🚨︎ report
Matrix multiplication conditions

I've been wondering for a couple years now. Why must the number of columns of the first matrix equal the number of rows of the second? I've been trying to look for an explanation for this but haven't found one. Does anyone know or can explain this to me?

πŸ‘︎ 4
πŸ’¬︎
πŸ“…︎ Jul 07 2021
🚨︎ report
[Matrix Multiplication] Why do the columns of the first matrix need to equal the rows on the second matrix?

First of all, I would like to say that I'm in no way good at math. I try to learn concepts when I need them for something else so I might be missing some prerequisite knowledge.

I'm currently learning about matrix multiplication and I just don't get why you need to transpose B if A is matrix with the shape of (p, m) and B is a matrix with the shape of (n, m).

To me it seems that if the length of the rows (the amount of columns) are the same in both matrices, the multiplication can be done row for row without having to transpose matrix B first.

I tried to multiply two equal matrices with the shape of (3, 4) without transposing the second and came up with the same result as doing it with the transposition.

Is it just a convention to do it the right way or am I missing something in the bigger picture?

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Destroian
πŸ“…︎ Jun 05 2021
🚨︎ report
Matrix multiplication interpretation v.redd.it/s4qem9p77wy61
πŸ‘︎ 37
πŸ’¬︎
πŸ‘€︎ u/TheCloudTamer
πŸ“…︎ May 13 2021
🚨︎ report
ELI5: Why is matrix multiplication important in linear algebra and why is it needed?

I understand basic algebra and how matrix multiplication is done but why is it needed? I can't visualize it or come up with practical examples.

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/PhysicsNinja10
πŸ“…︎ Jun 11 2021
🚨︎ report
ArrayFire Matrix Multiplication Vectorization

Hello guys, I am using ArrayFire library for signal processing. I am just curious about how to make more efficient my code. I read vectorization guide in docs, but i just ended up using gfor construct. Is it possible to improve efficiency ? Is there a better way for vectorization ? ( I hope there is :)

Note : I am aiming CUDA performance.

Here is the code which i am trying to improve :

#include <arrayfire.h>
#include <stdio.h>
#include <af/util.h>

static int proc_size = 1024;
static int fft_size  = proc_size * 4;
static int staves    = 288;
static int beams     = 256;

static af::array S;
static af::array B;
static af::array R;

void fn()
{
    gfor ( af::seq i, fft_size )
       R( i , af::span ) = matmul( S( i , af::span ) , B( af::span , af::span , i ) );
}

int main(int, char **)
{
    S = af::randn( fft_size , staves , c32 );

    gfor ( af::seq i, fft_size )
        S( i , af::span ) = af::randn( 1 , staves , c32 );

    B = af::randn( staves , beams , fft_size , af::dtype::c32 );
    R = af::constant( af::cfloat { 0 , 0 } , fft_size , beams );

    try
    {
	af::setDevice( 0 );
        af::info();

        double time = af::timeit(fn);

        printf( "Took %f secs.\n" , time );
    }
    catch (const af::exception &ex)
    {
        fprintf(stderr, "%s\n", ex.what());
        throw;
    }

    return 0;
}
πŸ‘︎ 7
πŸ’¬︎
πŸ‘€︎ u/ozancansell
πŸ“…︎ May 19 2021
🚨︎ report
Why exactly is matrix multiplication the way it is?

Just started recently with it at school.

Multiplying 2 regular numbers makes sense. 2*3 is merely 2 added to itself thrice

Adding 2 regular numbers? Makes sense too. 2+3 is incrementing the value of 2 with three

Matrix addition is also simple. You are basically incrementing the elements of the first matrix by the corresponding elements of the second matrix

But the rules behind matrix multiplication confuse me. Why SHOULD the numbers of columns of first matrix match up with the numbers of the rows of second matrix? Why do we multiply a row of the first matrix with each column of the second matrix? If 2*3 is 2 added to itself thrice, how does that work for matrices? How does one add matrix A to itself matrix B times?

The rules seem very arbitrary to me, and I am not able to understand it. I did search on the internet but a lot of the answers mentioned concepts and terms I dont understand and have not been taught yet like transformation and other things

Can anyone intuitively explain the concept and the rules of matrices multiplication to me? Why exactly is it done the way it is? Is there any specific reason or is it just plain definition?

πŸ‘︎ 61
πŸ’¬︎
πŸ‘€︎ u/The__BoomBox
πŸ“…︎ Mar 19 2021
🚨︎ report
Vector API (JEP 338) Benchmark Results for Matrix Multiplication, Image Convolution, and Image Thresholding. github.com/lessthanoptima…
πŸ‘︎ 69
πŸ’¬︎
πŸ‘€︎ u/lessthanoptimal
πŸ“…︎ Mar 22 2021
🚨︎ report
[Linear Algebra] Matrix multiplication

Given two matrices A and B where A is an (m x n) matrix and B is an (n x p) matrix, with columns [b_1, b_2, ... , b_p], can the product AB be thought of as the matrix with columns

[Ab_1, Ab_2, ... , Ab_p]

πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/Milk_No_Titties
πŸ“…︎ Jun 01 2021
🚨︎ report
I need to multiply the first row of a matrix by one number, and the other row by another number with matrix multiplication

I have this matrix

https://preview.redd.it/hw9vradmt6y61.png?width=106&format=png&auto=webp&s=bde91083ecca191748710a79ffe88a00776cf681

I need to end up with this matrix, i.e. multiply top row by 12 and bottom row by 20:

https://preview.redd.it/df14i8qst6y61.png?width=124&format=png&auto=webp&s=c34cc26a1d8951f22fb8d148985efc4c3e0663b4

How do i do this with matrix multiplication?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/hardstuck4ever
πŸ“…︎ May 10 2021
🚨︎ report
Matrix Multiplication Inches Closer to Mythic Goal quantamagazine.org/mathem…
πŸ‘︎ 25
πŸ’¬︎
πŸ‘€︎ u/Sampo
πŸ“…︎ Mar 23 2021
🚨︎ report
Auto-vectorization of Matrix-Vector Multiplication

I am struggling to vectorize an inner loop in a matrix vector multiplication. The initial column-wise multiplication works well. However, I need a sparse version where I can skip 16x1 sub-blocks (i.e. the inner loop).

I suppose, I am running in a bunch of bound checks? Due to array_chunks, it should be guaranteed, that the inner loop are contiguous 16x1 blocks. Godbolt: https://rust.godbolt.org/z/fzn6n995d

I also tried a const generics version, which does not produce any assembly. Does anyone have an idea why?

Edit: as u/WafflesAreDangerous suggested adding a main() instantiates it. It generates no vectorized code either, even though there shouldn't be any bound checking since the sizes are known at compile time. https://rust.godbolt.org/z/MEovzc7sc

Edit2: The C version does also not the best job of vectorizing, however it does (obviously) no bounds checking: https://c.godbolt.org/z/4cPeacGqb

πŸ‘︎ 20
πŸ’¬︎
πŸ‘€︎ u/rikorose
πŸ“…︎ Mar 30 2021
🚨︎ report
Comedy gold: Redditor proves that Craig doesn't understand high school level Matrix math by showing he incorrectly thinks Matrix multiplication is commutative. BSVer who's been lying about having credentials in math argues that Craig's math is right. reddit.com/r/bsv/comments…
πŸ‘︎ 20
πŸ’¬︎
πŸ‘€︎ u/Zectro
πŸ“…︎ Mar 31 2021
🚨︎ report
Uh oh ... We've hit a new level , the Matrix level . OctoKitty just invented the Multiplication Potion , curently not for sale ...
πŸ‘︎ 12
πŸ’¬︎
πŸ‘€︎ u/CequezHatesMortis
πŸ“…︎ Feb 28 2021
🚨︎ report
Suppose {v1, v2, ..., vn} is a basis for R^n. Let A be an invertible nxn matrix. Is A*{v1, v2, ..., vn} still a basis for R^n? What if A is not invertible. Does the multiplication still form a basis? Please explain the reasoning.
πŸ‘︎ 6
πŸ’¬︎
πŸ‘€︎ u/1500Calories
πŸ“…︎ Apr 01 2021
🚨︎ report
Pearson VUE Spreadsheet Matrix Multiplication

I'm trying to use the MMULT function on the Pearson spreadsheet, and it will only return one value in once cell instead of an array. I tried pressing control + shift + enter and it still returns one value. I have no problem getting it to work in excel. Any help would be appreciated!

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Reddituser5253
πŸ“…︎ Apr 18 2021
🚨︎ report
[2015 day 10][m4] optimizing by matrix multiplication

I saw this hint in the 2015 day 10 megathread about using Conway's Cosmological Theorem to determine the length of N iterations of the look-and-say algorithm (yes, the same Conway that invented Game of Life that appears in many other AoC puzzles). But if that comment had any code to copy from, I could not find it; it was just the high-level overview. So, I got to re-discover how it worked for myself, and indeed it made a huge difference in performance!

My original implementation did what most solutions did: iterate over every single character to look for the repetitions to produce the string for the next round. And the fact that the string gets ~30% larger each round shows: computing just part 1 took 6.6s, computing part 2 extended the time to 1m30s, and then my code took another 30s just counting the length of the 2 computed strings (2m5s total runtime). Using m4's trace functionality, I determined that my code performed 37,990,946 ifdef(), 38,032,571 ifelse(), 44,980,837 pushdef(), and 8,017,641 incr() calls, plus 2 very costly len(). And trying to expand the solution to 60 iterations would be that much more painful.

But my new version is able to compute answers without ever expanding the string in memory! As Conway noted, you can set up a 92-element recurrence relationship matrix, where each row represents an element (a substring whose expansion will never overlap with the rest of the answer), and each column represents which other elements will be present after the expansion of the element. All the inputs I've seen happen to be 10-

... keep reading on reddit ➑

πŸ‘︎ 15
πŸ’¬︎
πŸ‘€︎ u/e_blake
πŸ“…︎ Mar 28 2021
🚨︎ report
[Year 11: Mathematics] Can anyone explain matrix multiplication between two tensors ?

If a have a 2 tensors say:

[[1,2,3,4,5],[6,7,8,9,10]] = matrix

matrix * matrix = x

How would I multiply this/ how does the process work to find a value for x?

πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/OscarBeresford
πŸ“…︎ Mar 31 2021
🚨︎ report
Vector API (JEP 338) is about 1.7x faster: Benchmark Results for Matrix Multiplication, Image Convolution, and Image Thresholding. github.com/lessthanoptima…
πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/lessthanoptimal
πŸ“…︎ Mar 22 2021
🚨︎ report
Tool for visualizing matrix-vector multiplication with color values desmos.com/calculator/gvu…
πŸ‘︎ 4
πŸ’¬︎
πŸ‘€︎ u/hat-tippe
πŸ“…︎ Apr 03 2021
🚨︎ report
Any ideas what could the problem be? I am very bad at math/geometry. This achieved by multiplying and adding distances and direction vectors since transform matrix multiplication is way too hard for me. v.redd.it/uzrs9g2zvnk61
πŸ‘︎ 8
πŸ’¬︎
πŸ‘€︎ u/thr3rd
πŸ“…︎ Mar 02 2021
🚨︎ report
Intuitive way to write out Matrix Multiplication

I was never taught to write out matrix multiplication this way, maybe others have, but I came up with it independently recently when I was getting back into linear algebra. The key is that the top of the first matrix and the left side of the second matrix need to form a *square*. Writing out matrix multiplications like this makes them way more intuitive, including immediately knowing the dimensions of your output matrix, which is really useful in practice. It is also easy to extend for multiplying many matrices in sequence as well as left multiplying. Hopefully this helps you out as much as it did me https://blogs.ams.org/mathgradblog/2015/10/19/matrix-multiplication-easy/

πŸ‘︎ 69
πŸ’¬︎
πŸ‘€︎ u/disgolf
πŸ“…︎ Nov 19 2020
🚨︎ report
Question about Matrix Multiplication

I often wonder about the design of certain conventions in math, why they are the way they are.

For instance, with matrix multiplication, if you have a product of two matrices AB, why did they decide to make it the dot product of B's columns with A's rows?

If you transpose B, then the output is simply the dot product of each corresponding row. Doesn't that seem simpler? Why complicate things?

πŸ‘︎ 2
πŸ’¬︎
πŸ‘€︎ u/Microwave_on_HIGH
πŸ“…︎ Feb 20 2021
🚨︎ report
Matrix Multiplication Inches Closer to Mythic Goal quantamagazine.org/mathem…
πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/qznc_bot2
πŸ“…︎ Mar 23 2021
🚨︎ report
Matrix Multiplication Inches Closer to Mythic Goal quantamagazine.org/mathem…
πŸ‘︎ 3
πŸ’¬︎
πŸ‘€︎ u/I_did_dit
πŸ“…︎ Mar 23 2021
🚨︎ report
Matrix multiplication
πŸ‘︎ 121
πŸ’¬︎
πŸ‘€︎ u/peachesonsteam
πŸ“…︎ Oct 07 2020
🚨︎ report
[2020 Day 10 part 2] [Python] Number of paths by doing matrix multiplications

Well, it's probably not the most optimal way (especially if you're aiming for performance on bigger inputs), but I think it's pretty neat. The idea is as follows:

  • The relationship between different adapters can be represented as a directed graph, where the adapters are the nodes and an edge exists from adapter A to B if you can plug B after A, that is, if B - A <= 3.
  • The problem guarantees that whenever an edge A->B exists, B > A. This implies that we can never go "backwards" in the graph to a lower rated adapter.
  • The previous point guarantees that any path from 0 to the last adapter is a valid solution.

Then, asking for the number of possible adapter combinations is the same as asking for the number of distinct paths from the node 0 to the node representing the last adapter. This is where adjacency matrices come into play. An adjacency matrix M is a NxN square matrix where N is the number of nodes in the graph. The value for any element M(i, j) in the matrix is 1 if there exists an edge connecting the node i to the node j, and 0 otherwise.

Now, adjacency matrices have a very neat property: if you compute M^k, then M^k(i, j) becomes the number of distinct walks of length k between the nodes i and j. Since we cannot go "backwards", and since we have no cycles in this graph, this is the same as the number of paths, which in turn is the same as the number of valid adapter combinations that include k adapters. We can repeat this process up to the total number of adapters to obtain all possible combinations:

    f = lambda i, j: j &gt; i and lines[j] - lines[i] &lt;= 3
    m = np.fromfunction(np.vectorize(f), (n_lines, n_lines), dtype=int).astype(int)
    aux = np.identity(n_lines)
    
    sol_part_2 = 0
    for _ in range(n_lines):
        aux = aux @ m
        sol_part_2 += aux[0, n_lines - 1]

tl;dr: you can solve day 10 part 2 by building an adjacency matrix, multiplying it by iself N times and adding the (0, N) element where N is the number of adapters.

πŸ‘︎ 13
πŸ’¬︎
πŸ‘€︎ u/Coffee_Doggo
πŸ“…︎ Dec 11 2020
🚨︎ report

Please note that this site uses cookies to personalise content and adverts, to provide social media features, and to analyse web traffic. Click here for more information.