Little did you know, LC-style question is never about grinding LC. Algorithmic puzzles are one of the few legal ways of measuring candidate's IQ without directly asking. Companies are looking for a way to hire smart people, so they rely on LC as a signal. It can be replaced with any similar signal as well (ranging from how many cats can you ship to ISS to solve blackhole physics.)
I might buy that, except for how cheesy the actual questions are.
If you subscribed to the old "Daily Coding Problem" email list, you'd know. Those guys collected actual questions asked in interviews ca 2010-2015, and sent them back out. About half were so poorly worded that interviewers couldn't possibly get anything out of them. Some of the questions required zero algorithmic thinking, or there was only one possible solution. Also, getting a flash of physical insight to solve a problem rarely happens when you're in a high-stress situation.
This is one of the most persistent myths in all of hiring. It is not unlawful to test IQ for white collar job candidates. Companies don't use IQ tests because they're not particularly effective, not because they'd get in trouble (reputation aside) for doing so. I don't believe anybody who (a) says stuff like this about Leetcode and (b) works professionally in this industry actually believes they could productively hire off an IQ test.
Some of the FAIR people moved to Thinky, and they also started doing encoder-free MM-LLMs. Now Google. This seems to becoming a trend working at small scale, but the difficult part is scaling.
Standard approach for training MM-LLMs is we train the encoder first, there are O(2-10B) good images on the internet, so encoder needs to see each image O(10-100) times, that is O(100T) tokens, which is more than the entire pre-training budget for most runs. That is the reason we train the encoder separately (smaller model, 2B active vs 30B or 200B active LLM); there is nothing magical about training the encoder and LLM together, it is just more token-efficient to train the image modality first.
In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.
4.1 they made it much faster, so a lot of infra improvements.
4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.
4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.
4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
It's just amusing reading all these posts with different viewpoints, just in this thread there are multiple people saying 4.6 was so much better than 4.7 and that they switched back to 4.6.
I also find it amusing. I also heard a lot of "4.7 is garbage, everybody hates it". Shows you how important proper validation techniques are, not just gut feeling.
I gave 4.6 a miss and only recently switched from 4.5 to 4.7. I found on a particularly different task 4.5 struggled with (getting stuck in loops and trying to convince me the problem had been solved) was quite solvable with 4.7.
i'm not sure if i'm hallucinating, but i swear i had codex in the chatGPT app from long time ago (like the original codex on the web).
they added some new stuff, like remote control to wherever the desktop codex app is running, but these companies need to work much more on their press releases.
yeah, even on product lines that they kill (like Stadia) they usually do right by the user (eg they refunded everyone, both on hardware and software people bought on the platform).
In my anecdotal experience, it is not. Same model, opus, works better in 3P harnesses such as Factory Droid or Amp.
Claude code, on the other hand, is the most subsidized one, both for consumers (through max subscription) and for enterprises (token discounts). It is also heavily optimized for cost, specially token caching and reduced thinking, at the expense of quality.
reply