Somesh Misra / ERP.ai @MathproBro

chief researcher at https://t.co/85QLNI0SE9 | working at the intersection of business processes, neural network topologies & machine learning erp.ai San Francisco, CA Joined February 2013

Tweets

2K
Followers

830
Following

270
Likes

1K

Timothy Nguyen @IAmTimNguyen

4 weeks ago

Mathematics as a field is going to have to reorient itself in light of powerful AI. But a slight pushback to Gowers's comment: "If LLMs are at the point where they can solve 'gentle problems', ...the lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting." Mathematics is infinite and thus inexhaustible. By having powerful AIs that can do heavy lifting, more of the burden is shifted towards taste and asking the right question. The possibility of discovering something by looking in the right place that everyone else missed becomes possible. In mathematical physics for instance, an Einstein with inspiration of the equivalence principle might not have to toil for a decade to invent general relativity, but could have equations proposed, their solutions found, and scenarios validated as limits of Newtonian physics. Contributing to mathematics, rather than having the bar raised for problem-solving, has opened up for ideation and generation.

Timothy Gowers @wtgowers @wtgowers

4 weeks ago

But if AI mathematics continues to progress at anything like its current rate -- which is what I expect to happen -- then we will face a crisis very soon, and mathematics departments, who owe a duty of care to their students, should be urgently preparing for it.

71 157 1K 519K 300

27 29 219 42K 85

View Details

Somesh Misra / ERP.ai @MathproBro

2 months ago

@xuanalogue looked at your CLIPS paper, so yes, an AI that truly infers a student's hidden goals and epistemic state might enable persistence instead of enabling shortcuts. :)

0 0 1 18 0

View Details

sphinx @protosphinx

3 months ago

sarvam is doing some phenomenal work. seeing positive commentary on r/locallama too

Pratyush Kumar @pratykumar

3 months ago

📢 Open-sourcing the Sarvam 30B and 105B models! Trained from scratch with all data, model research and inference optimisation done in-house, these models punch above their weight in most global benchmarks plus excel in Indian languages. Get the weights at Hugging Face and

206 1K 7K 745K 1K

7 37 565 13K 20

View Details

Somesh Misra / ERP.ai @MathproBro

3 months ago

Paper: “Demystifying Oversmoothing in Attention-Based Graph Neural Networks” (NeurIPS 2023, spotlight) By Xinyi Wu, Amir Ajorlou, Zihui Wu & @jababi at MIT/Caltech. Key move: they model attention-based GNNs as nonlinear time-varying dynamical systems and use joint spectral radius theory to prove oversmoothing is inevitable for GCNs, GATs, and graph transformers. Covers ReLU, LeakyReLU, GELU, SiLU. No architectural trick escapes it. The only way out is rethinking how depth is applied. 📄 arxiv.org/abs/2305.16102

0 0 0 72 0

View Details

Somesh Misra / ERP.ai @MathproBro

3 months ago

Everyone thought attention would solve oversmoothing in GNNs. It doesn’t. It can’t. Rigorous proof: expressive power in attention-based GNNs collapses exponentially with depth. GATs, graph transformers - none are immune. The real insight? Depth shouldn’t be uniform. A boundary node sitting between two communities needs 2 layers. An interior node in a dense cluster might need 10. Treating them the same is the actual problem. Structure should dictate depth. Not the other way around.

1 0 1 102 0

View Details

Somesh Misra / ERP.ai @MathproBro

4 months ago

This nomenclature always confused me! NP hard sounds like it's a subset of NP, but NP is verifiable, and NP hard is hard to solve. Knuth suggested three names "Herculean", "Formidable", and "Arduous", and sent out a poll to people in theory community. one write-in suggestion was "Hard-Ass Problems" (Hard As Satisfiability). Bell Labs won with "NP-hard" and they've been confusing people ever since. The real NP-hard problem was naming NP-hard.

0 0 0 65 0

View Details

Somesh Misra / ERP.ai @MathproBro

4 months ago

Underlying reason: Continuity and symmetry induce equivalence classes over inputs. Transformers collapse nearby sequences into the same representation orbit. Perplexity is invariant on these orbits. Correctness is not. This was never about Perplexity the company. It is about algebra, group actions, and quotient spaces.

0 0 0 97 0

View Details

Somesh Misra / ERP.ai @MathproBro

4 months ago

Paper link: arxiv.org/abs/2601.22950 cc @PetarV_93 Thank you for formalizing something many of us felt but could not prove.

1 0 2 157 0

View Details

Somesh Misra / ERP.ai @MathproBro

4 months ago

Perplexity is not always right. It can appear confident and rigorous, and it can score extremely well by its own metric, while still producing an incorrect prediction. This is not a bug or a training artifact. The result comes from the paper “Perplexity Cannot Always Tell Right from Wrong”

1 0 2 186 0

View Details

Somesh Misra / ERP.ai @MathproBro

4 months ago

This insight leads to a set of fundamental group theory based results. I have tried to characterize which forms of node-level memorization are inevitable in GNNs and which require symmetry breaking. Paper coming after review.

0 0 0 77 0

View Details

Somesh Misra / ERP.ai @MathproBro

4 months ago

Hot take: a lot of GNN memorization isn’t learned at all. It’s forced. Graph symmetry + training dynamics decide what a GNN can and cannot memorize — before data even enters the picture.

1 0 0 98 0

View Details

Somesh Misra / ERP.ai @MathproBro

5 months ago

Three claims/theorems about deep learning that seem difficult to disprove and even harder to prove: A) Gradient descent does more than minimize loss. It reshapes geometry by collapsing directions that are irrelevant to the task (gradient flow induces anisotropic contraction in the pullback metric, with decay along directions orthogonal to the loss gradient). B) Symmetry does not need to be imposed. When data and objectives are invariant, training dynamics tend to uncover quotient structure implicitly (optimization trajectories concentrate on equivalence classes induced by approximate group orbits, even without architectural equivariance). C) Memorization is not storage. It is the emergence of extremely sharp decision geometry confined to negligible-volume regions (interpolation is achieved via high-curvature decision boundaries localized to sets of vanishing measure in input space). These are not easy theorems. But they feel like the right ones to chase. Genuinely looking for advice, counterexamples, or references from people thinking deeply about this: @levie_ron @kamalikac @rsalakhu @ok1zjf @neelnanda5 @mmbronstein

0 0 1 162 0

View Details

Somesh Misra / ERP.ai @MathproBro

5 months ago

A doubly stochastic matrix only redistributes values. It cannot amplify them or destroy them. Geometrically, it is a soft mixture of permutations. It shuffles and mixes, but conserves total signal. Identity is one extreme case of this. So mHC does not abandon the identity idea. It generalizes it. Identity becomes a stable geometric object instead of a single point. That is the breakthrough: deep learning stability enforced by geometry, not tricks.

0 0 0 95 0

View Details

Somesh Misra / ERP.ai @MathproBro

5 months ago

That learned matrix gets applied again and again across layers. Now depth is no longer identity plus correction. It is repeated application of an unconstrained matrix. We are back to the original instability problem. mHC fixes this by using geometry. Instead of letting the identity be any learned matrix, it restricts it to a special space called doubly stochastic matrices. No math needed. Here is the intuition.

1 0 0 106 0

View Details

Somesh Misra / ERP.ai @MathproBro

5 months ago

The DeepSeek mHC paper is a real breakthrough, and the reason is geometric, not architectural. Early neural networks were just repeated matrix multiplications: x <- W x. Depth was unstable. ResNets changed one line: x <- x + F(x) which linearizes to x <- (I + W)x. That single identity term is what made deep learning scale. Hyper-Connections broke this by replacing identity with a learned matrix, turning depth back into unconstrained matrix products. mHC fixes this in a principled way. Instead of identity or an arbitrary matrix, mHC uses a doubly stochastic one. Doubly stochastic matrices form the Birkhoff polytope. They are convex combinations of permutations. Geometrically, the residual stream undergoes conservative transport and mixing, not amplification or decay. Identity is just one extreme point of this space. Under composition, stability is preserved. mHC does not abandon identity. It generalizes it into a stable geometric object. This is not an engineering trick. It is linear algebra and geometry doing the real work