Abacus Semiconductor Blog

Official Blog

This blog is a compilation of thoughts I have had over the course of the past few months, based on and triggered by newspaper articles as well as comments from customers and partners. The very vast majority of the blog will be about HPC and what is not working as well as it could. I'll include news and my opinions on the suppliers into HPC, industry outlooks, tech trends and economic projections if I feel comfortable with those. From time to time, I might bring up things that in general within the framework of IT annoy me, and of course I'll include news and my comments on IT security, privacy and authentication when appropriate.

I will try my best to keep it current and relevant.

Kloth Computer Architecture

Posted on 2025-02-18 by: Axel Kloth

Computer Scientists and engineers have relied for decades on the von-Neumann (Princeton) and Harvard CPU (Central Processing Unit) architectures. However, the computational needs have evolved. Today, computational challenges are large-scale problems that cannot be solved in a single processor or accelerator such as a GPGPU (General Purpose Graphics Processing Unit). In fact, most AI data centers today house tens of thousands of servers. Usually, these data centers are working on large models (for AI and other workloads) and simulations where the data is spread out across all processors and all accelerators.

Initially, all compute was executed in software on CPUs, but that became impractical, and GPGPUs were used as the first generation of accelerators. While that to some degree alleviated some problems, it turned out that the communication between any two entities became a much bigger predictor of system performance than the performance of any individual component. PCIe-attached GPGPUs simply did not have the means to share data effectively and efficiently. Solutions such as NVLink (and the competing UALink) were created to help solve this problem, but they only aggravated issues such as cache coherency across CPUs and GPGPUs, especially if the coherency domains were large.

The introduction of CXL (Compute eXpress Link), which is an open standard interconnect for high-speed, high capacity central processing unit (CPU)-to-device and CPU-to-memory connections, did not improve the cache-coherent Inter Processor Communication (IPC) at all, as it uses a high-latency interconnect (PCIe, Peripheral Component Interconnect express) as the underlying hardware and protocol layers.

We have developed a post-Harvard CPU and Accelerator Core Architecture that removes the scale-out limitations imposed by current processors and accelerators. This Kloth Architecture is a unified internal and external interconnect that replaces NVLink, UALink, CXL and to some extent UCIe.

Our patent-pending architecture adds a scale-out port to each CPU and accelerator core that allows direct connectivity between general-purpose CPU cores, accelerator cores and smart multi-homed memories, and secondary infrastructure such as peripherals. This allows us to separate I/O from peripherals and all of the aforementioned from any kind of Inter Processor Communication and from memory I/O, and optimize all communication channels accordingly.

FMS Conference Call for Papers

Posted on 2025-02-05 by: Axel Kloth

The established and well-respected Flash Memory Summit had expanded its scope over time, and it became apparent that it covered vastly more than just Flash. In fact, for most of the past half decade, its coverage of novel memory types that were not Flash were exceeding Flash sessions. Therefore, last year it decided to rebrand itself to "The Future of Memory and Storage".

This year's FMS is fully rebranded under The Future of Memory and Storage.

The author of this post and Founder of Abacus Semi – Axel Kloth – has been appointed to the FMS Conference Advisory Board (CAB) again. As such, I advise my readers to please check out the FMS Call for Papers at the above URL, particularly if you are working on a RISC-V project for anything related to memory or storage. If you are an established company and want to exhibit, there may still be a few slots left. In that case, you will need to contact Kat Pate, the FMS CEO and VP Sales. The conference will again take place at the Santa Clara Convention Center.

2024 Semiconductor Revenue Wrap-Up

Posted on 2025-02-04 by: Axel Kloth

According to a number of semiconductor analysts, the global semiconductor industry reached $626 billion in revenue in 2024. That was a growth of 18.1% from 2023. The market will further grow to $705 billion in 2025. Consensus is that the market expansion was largely due to data center applications, not client systems, and more specifically, by the demand for AI, which largely is relying on GPGPUs. Data center semiconductor revenue reached $112 billion in 2024. That is nearly double the 2023 revenue of $64.8 billion. In absolute numbers the data center revenue is now the second-largest market after smartphones, but it shows a vastly higher growth rate.

Another element of growth of semiconductor revenue comes from memory. This includes DRAM and (mostly) NAND Flash, but also a novel memory type that did not even exist a decade ago: High Bandwidth Memory (HBM), which is a DRAM Stack that is used inside of processor packages and connected to the processor die via Through Silicon Vias, or TSVs, or in some cases, through a substrate.

Abacus Semi is addressing that market with its processors, accelerators and smart multi-homed memory subsystems.

The undead rise – again

Posted on 2025-01-29 by: Axel Kloth

Processor-In-Memory (PIM) or In-Memory-Compute (IMC) has been around as an idea forever. The idea behind Processor-In-Memory or In-Memory-Compute is that it can be used to solve inherently parallel problems easily.

Logic is fast, power-hungry and less dense. DRAM is dense, slow and less power-hungry (except for when data is being moved).

The assumption is that even if each processor adjacent to a DRAM memory cell built in DRAM process technology is slow, the density advantages of DRAM will help outperform the traditional von Neumann or Harvard architectures.

That assumption has never been proven, and while there is a density and cost advantage, the factor is between 6 and 8 for simple cells, but not for anything that contains more logic. In fact, while a DRAM-process technology based NAND may be still smaller than its logic counterpart, an ADD or a MULT will not be.

Even at Parimics (a vision processor startup this founder had co-founded in 2002), the 3 x 3 CONV – which is a trivial set of 9 fused multiply-adds – was bigger, slower and more power-hungry in a DRAM process technology versus in a logic process.

The density argument therefore never held in reality. Even the power argument prove to be a read herring as data movement today takes more energy than an ADD. In other words, the density and power per square mm is not any better because data movement dominates many operations that are needed in parallel machines.

This is regularly forgotten, and PIM and IMC pop up every 3 to 5 years from the dead, get researched and money allocated, then die off again.

Contention in Switch Fabrics

Posted on 2025-01-25 by: Axel Kloth

By the very definition a switch is collision-free, but not contention-free. That means that A can talk to B while C talks to D. However, if A and B both talk to C, then the channel will be overloaded and as such, backpressure must be asserted. The faster it is asserted, the fewer packets will be in transit (or as many people would say, in flight). Since they are already on their way, the receiving side must accept them and therefore have ingress queues. That means that there must be ingress queues on the receiving side, which is the outgoing port side. If it does not have those queues, the receiving side must drop them. Said in another way, if the switch is supposed to be lossless, it must assert backpressure to the senders very quickly and it must have ingress queues on the outgoing side of the switch. This will prevent loss inside the switch in case of contention (i.e. "overloaded" channel). It also must have ingress queues on the ingress side of each port on the switch to accept packets that were already sent out by the external source. The simplest part of a switch is actually the switch itself, or the crossbar as it is called from time to time. The parser and scheduler are also fairly simple. The queues will be large, and the larger the anticipated latency for backpressure assertion or drop decision, the larger the queues need to be.

The next most important issue is of course I/O. The best switch is useless if it cannot deal with lots of traffic, coming in and departing through many ports with lots of aggregate bandwidth. Today, these ports are all based on High Speed Serial Links, or HSSLs.

FMS renames itself

Posted on 2024-01-19 by: Axel Kloth

The Flash Memory Summit that for 18 years has presented the state of the art and advances in Flash Memory has decided to rename itself to properly reflect its extended scope beyond Flash. Expect to see DRAM, Flash, Phase Change Memory and other novel main memory technologies, and novel and updated existing mass storage technologies. While the acronym remains FMS, it is now The Future of Memory and Storage.

The old FMS URL still works and will be kept for a while, but FutureMemoryStorage.com is the new FMS URL.

Make sure to sign up to see the exciting new developments in The Future of Memory and Storage.

RISC-V Summit 2023

Posted on 2023-11-09 by: Axel Kloth

This years' RISC-V Summit at the Santa Clara Convention Center in California prove that RISC-V is not only here to stay, it is growing. The ecosystem around it is expanding at a nice pace as well. There are more companies offering RISC-V cores, processors, boards, computers, handhelds, laptops and even servers than last year. More compilers, debuggers and other tools and operating systems as well as hypervisors are available, and tools for the verification of ISA compliance as well as the CPU design itself are becoming mainstream.

What was surprising to me was that there was a good number of end users who showed interest, and that is what usually starts commercial success of a platform and an ISA. While x86-64 is not going to go away any time soon with its installed base of software that can't easily be replaced, ARM is being pushed aside by RISC-V in several areas. IOT and IIOT are probably the biggest areas in which RISC-V has potential to completely displace ARM as neither x86-64 nor ARM have developed a stronghold there yet. Next up might be feature phones and possibly tablets to feature RISC-V and threaten ARM. The OSes now exist, and with Android on RISC-V a reality, cost-sensitive feature phones might switch to the RISC-V platform entirely. A preview of all of this was available at the Summit.

It may have been coincidence, but Silicon Angle reported that Arm’s stock sinks on lower guidance following first post-IPO earnings call. I had posted a blog entry not too long ago and pointed out what I had thought about the ARM IPO and my reasons for why it was overpriced and overhyped. Reality caught up very quickly with ARM and its leadership. To some degree, RISC-V has already impacted ARM and its stock price.

Large Language Models

Posted on 2023-11-09 by: Axel Kloth

As the name "Large Language Model" implies, generating an LLM requires large input data to feed into a Generative Pre-trained Transformer. The generation of an AI model for a LLM in the backend is a computationally hard and memory-intensive application, both in terms of bandwidth and latency requirements. CPUs do not have enough cores to effectively execute those functions, and GPGUS are neither general-purpose nor do they have an efficient and effective way to directly communicate with each other. As such, the direct interconnect is missing, and another portion that is missing is a scale-out port to connect many of them together at maximum bandwidth and minimum latency. These limitations are known as the memory wall, the von-Neumann bottleneck and the Harvard architecture limits.

We overcome all of them. That technology is now patent-pending.

Beyond Harvard CPU Architecture

Posted on 2023-09-12 by: Axel Kloth

For decades now, Computer Scientists and users alike have complained about the von-Neumann bottleneck (input – processing – output with some instruction and temporary data I/O to and from memory). The only suggested solution was the Harvard architecture, separating out instruction and data I/O to and from memory. No progress had been made ever since.

We have developed a post-Harvard CPU and Accelerator Core Architecture that does away with the scale-out limitations imposed by current processors and accelerators, including GPGPUs.

Securing a Server

Posted on 2023-09-12 by: Axel Kloth

I must have poked into a hornet's nest with my two recent blog posts on BMCs and on DPUs. I got a whole bunch of angry emails in response. That is a good thing.

In short, neither a DPU nor a BMC alone can secure a server. It takes vastly more than that, and most importantly, the server's host CPU and the BMC's BIOS must be fully secured, encrypted and authenticated.

Let me recap how to secure a server these days. Unlike a decade ago, a server today contains a whole bunch of smart devices that have their own processors and boot code and BIOSes. As a result, all of them must be secured individually to secure the server as a whole. Here is an incomplete list: CPU(s); DPU(s); SAS, SATA and RAID Controllers if present; all accelerators including GPGPUs; the authentication processor in the Root of Trust coprocessor if present; the TPM or vTPM and of course the BMC. All of these have their own firmware and BIOS, so those must be protected against insertion of malicious code, and that only works if the code is encrypted to individual keys, not a manufacturer's key. In case of detected tampering, the device must not start up. Anything in the net user data path is critical and should be protected against the threats that usually affect data contents. The situation is different and even more crucial for the BMC, which is not in the net user data path, but access to it opens up literally everything in the server to an intruder. As such, the BMC must be secured as tightly as possible. Since the BMC is not in the net user data path, it can only protect itself and the OAM&P path of the boot CPU, nothing else. Those two items must never be confused.

ARM IPO

Posted on 2023-09-12 by: Axel Kloth

ARM is having an Initial Public Offering (again). This time around I am even less enthusiastic about it than I was at its first IPO. Back then, ARM was still Acorn RISC Machines. It was a novel RISC Instruction Set Architecture for low-performance devices (Apple tried it in the Newton MessagePad 700/710), and the ISA was somewhat acceptable. Nothing great, nothing terrible, but certainly not built to make any of the parts that make up a processor as simple as possible. The simpler and more straightforward the ISA, the simpler the instruction decoder and pipeline, and all subsequent units. That was not ARM, but it was acceptable. Today, the ARM ISA has become too big. A big and complex ISA makes every unit in the processor overly complex and power-hungry and prone to attacks and less conducive to performance improvements.

Another drawback is that licensing anything from ARM has become too expensive. There is a good chance that even if you license the core and do not need anything else, ARM will force you to take whatever is in its portfolio on top of the processor core IP even if you have in-house IP that is better suited to your and your customer's needs. So if you have other ASIC components that were developed in-house and perform better in the application that your customer targets than anything that ARM has in its portfolio, ARM might force you to license and use that IP.

Also, if the performance of the ARM core is not what you need, you cannot touch it unless you have a very expensive architecture license. And even then, ARM may sue you if you have licensed versions A through D, and you buy a company with a valid license to E, and you now incorporate version E into your products...

I do not like ARM's corporate leadership, and neither do I like its current owner, Softbank and its Vision Fund.

I will absolutely stick with RISC-V for all of those reasons. I can do whatever I want and implement it any way I want, and for as long as I can show I am in compliance with the RISC-V ISA, I am certifiable.

I will also not invest in ARM, and I will make sure that none of my portfolios buy into ARM's IPO. Even if Apple, Intel, Qualcomm and TSMC invest in ARM's IPO, I will not touch it.

The OpenROAD Project

Posted on 2023-09-07 by: Axel Kloth

I have had a lengthy discussion with the leadership team at Precision Innovations with regards to their The OpenROAD Project, and I came away incredibly impressed. Not only are they providing an entire open source EDA toolset and related libraries for most planar transistor semiconductor design efforts, they are clearly looking into the future of ASIC design. While the commercial tools maybe able to cover more processes down to the FinFET and Gate-All-Around nodes, the OpenROAD team has identified many issues that the average ASIC Design engineer needs, and has set out to solve those. Multi-die and MCM are already included in the flows, and integration of analog and mixed-signal into digital logic design are straightforward. What impressed me even more is the willingness and ability of the team and engineering to react quickly to suggestions. We will use that toolset for all of our proof-of-concept designs, which will also allow us to compartmentalize our designs and assist in better design verification.

Definition of a DPU

Posted on 2023-09-07 by: Axel Kloth

I have received a number of emails indicating that the function of a DPU is not entirely clear. Here is a brief overview of what a DPU does, and why it is of importance. A DPU is an acronym for Data Processing Unit, but that is very generic and does not say much. Generally, a DPU is a smart Network Interface Card (NIC) that allows certain traffic to be offloaded off the CPU. In the early days, smart NICs offloaded only some TCP processing from the server CPU, while all UDP traffic still needed to be processed by the server CPU. That prove to not be very useful, and as such smart NICs never really took off. That changed when the offload was augmented by allowing the processor on the smart NIC to autonomously transfer data from the host CPUs' memory to the recipients’ smart NIC, which in turn transferred the data to the receiving side's CPU's memory. That reduced the server CPU’s task to direct the DPU to conduct a transfer of data in region A in its memory to a destination server’s memory in region B. In other words, the server CPU had to send only a very short command with very few parameters to the DPU, and the DPU would then autonomously execute and transfer the data without further input from the CPU. This is called DMA, and in case the initiator requests data from the other side, it is called RDMA. Once smart NICs had evolved into this feature-rich device, it became useful and indeed offloaded the host CPU. With the cost of a DPU quickly dropping, the additional computational and data transfer capabilities make technical and financial sense. On top of that, most DPUs have their own memory and Operating System, and they can be used to filter data such that inbound threats are recognized and terminated. The threat databases can be automatically updated and synchronized, and even AI can be used to identify new threat patterns, all without increasing the computational load on the host CPU. That is only possible because the DPU is in the net user data path.

All major CPU vendors are looking at DPUs, and right now NVIDIA with its acquisition of Mellanox and AMD with its Pensando acquisition have DPUs for sale or are close to offering them. All current DPUs are PCIe-attached and as such are limited in their performance by the bandwidth available on PCIe and its high latency. CXL will not change that as CXL uses PCIe as the underlying infrastructure.

Our Server-on-a-Chip has a built-in DPU as well. Our DPU uses as much hardware offload (DMA Controllers, cryptographic accelerators, authentication engines, and a key management unit) as possible, and all functions that must remain on a programmable CPU core run on a RISC-V core set with hardware support for virtualization, using a highly modified version of OpnSense. Since our DPU is on-die on the Server-on-a-Chip, the PCIe limitations do not apply.

Flash Memory Summit 2023

Posted on 2023-08-15 by: Axel Kloth

This years' Flash Memory Summit at the Santa Clara Convention Center was back to being an in-person event, like last year. It ran from 2023-08-08 to 2023-08-10, and in my opinion, it was the most well-executed FMS ever. Kudos to the organizers. There were some obvious trends:

AI is picking up steam

Because AI – particularly generative AI during training – needs so much memory, memory size matters

CXL is coming. I don't like it for a variety of reasons, but everyone else thinks it is the second-hottest trend after AI (and due to AI)

ccNUMA may be back because of AI

Unified memory (DRAM and mass storage may converge) – I disagree with that, but that is what I heard many times

SATA and SAS spinning hard disks are dead. Long live the dead. WD and Seagate still make spinning disks

SATA and SAS SSDs are dead as they migrate to faster M.2 and M.3 and other PCIe-attached interfaces

Spinning disks are not really dead as they are relegated to cold or warm storage, instead of tape

Tape is not dead either. Mass storage is tiered

Moore's Law is dead, at least in 2D. However, while there is no real 3D chipmaking yet, 2.5D helps out. All Flash manufacturers can make Flash stacks with over 200 layers. Moore's Law is alive and kicking. I have said that in my blog for the past few years

We generate too much data. Humans do, and now AI will add to it, and more than ever, meaning that we either sift through it, or throw it away, or store it until we can sort through it. Which means that we will need all storage that anyone can make

Optane is still dead, but the memory that is PCM and was not at the same time – sort of like Schroedinger's cat – may be dead and alive at the same time, simply because Flash is not fast enough, and the endurance is still only a maximum of 3000 to 5000 cycles written per cell. Intel wrote Optane PCM off, and Samsung has renamed it to Solidigm, and while Solidigm is now selling Flash, it's clear that they are looking at alternatives to Flash

Next year, FMS will be even better. Maybe it will even change its name, as it certainly is not only about Flash any more.

Baseboard Management Controllers (BMCs)

Posted on 2023-06-12 by: Axel Kloth

Most servers have an integrated Baseboard Management Controller (BMC) on the mainboard. Its primary function is to assist in managing the server. This management includes updating the Firmware of the server, shutting it down and restarting it remotely, and interfacing with the Trusted Platform Module (TPM) and the Root of Trust coprocessor. As such, the BMC oftentimes has its own Network Interface Card (NIC) so that its function is uninhibited by the NIC of the server. In other words, even if the server had crashed and its Operating System is non-responsive, the BMC will save the day and be able to help the system administrator to remotely restart the server. It is therefore imperative that the BMC itself is very well secured against attacks, and that its NIC is independent of the servers’ NICs. The connection to the TPM allows it to use shared secrets so that Firmware updates can be authenticated. It may even contain a virtual TPM, in which case a physical TPM is not needed. The same is true for the Root of Trust coprocessor. However, the BMC is connected through PCIe to the server’s host processor, and that connection can be snooped on. Also, the host will boot from its own SPI Flash, independent of whether the BMC has access to it to update the Firmware. In other words, the BMC cannot guarantee the validity and the authenticity of the hosts’ Firmware. Rewriting the Firmware with a hardware SPI Flash tool or using malicious Software to install malicious Firmware is possible and doable. Preventing this is beyond the control of the BMC. On top of that, for the reasons mentioned above, the BMC’s NIC is not in the data path of the host processor, and as such cannot snoop on the NIC’s traffic. A Data Processing Unit (fancy term for a smart offload NIC with DMA and RMDA capabilities) can do that, and as such would be able to identify and block malicious traffic on the network, in ingress and in egress directions. A BMC cannot do that.

The BMC is as decoupled as possible from the net user data, and most decent system admininstrators even put the net user data facing NICs on a different VLAN from any other devices that are used for operation, administration, maintenance and provisioning (OAM&P). If all net user data is on VLAN1, and all OAM&P traffic is on VLAN2, then even someone on the LAN – including an intruder – will not see the BMC.

AI will need TB-level DRAM

Posted on 2023-06-12 by: Axel Kloth

Like pretty much everyone else, I have played with ChatGPT and a few other AI tools, and the more I did, the more I recognized that the convergence of Hardware requirements for AI, ML and traditional HPC are upon us. HPC always needed very large amounts of main memory, but it surprised me to see that both the training and the inference side of AI take considerable amounts of DRAM.

Any kind of model creation maxed out our engineering server – and that is not a small machine with 64 cores and 512 GB of DRAM. On the inference side it was a bit lighter, but assuming that if something works, the public will start using it in large numbers, I can foresee that a 16 GB laptop will not do in the future. I think that DRAM in a laptop soon will have to be 64 GB, and any servers on the training side will have to be TB-level DRAM machines. I would not be surprised if two or three years from now, we will see 4 TB servers on the lower end, and larger ones with 16 to 64 TB worth of DRAM in them. Considering that a lot of power is needed to run the DRAM protocol – SSTL-2 – that might be the next barrier to bring down.

ERC Fraud

Posted on 2023-06-12 by: Axel Kloth

I have a hunch that the next big revelation is going to be that there is a lot of fraudulent Employee Retention Credit activity going on. I keep receiving emails and calls alerting me to the availability of ERC for my company. One of those service providers even went so far as to say that according to the BBB (since when does the BBB have the authority and mandate to track that?), we had two employees during the ERC time frame at a salary level that would qualify for the full $26K of ERC per employee, and that he had prepared it all and I just needed to sign the document. He'd send it off and then upon the refund, would collect 15% of the loot from us. There are two problems with that: my company did not have employees during that period, and I am not sure what would happen if I signed, sent if off and received funds. Could very well be that I'd end up in prison for fraud, and that he'd be laughing from the outside as I would have defrauded the government, but the contract between me and him is a civil matter. Needless to say that I did not sign and blocked him on the phone, email and otherwise.

I am pretty sure that two or three years down the road we’ll see a lot of innocent but naive business owners fighting in court to stay out of prison, when the criminal ERC service providers laugh their butts off all the way to the bank.

Breach of the MSI/Intel Firmware Signing Keys

Posted on 2023-06-08 by: Axel Kloth

A few weeks ago on April 7th, 2023 news broke that someone broke into MSI's servers, and among other things they stole were the MSI Firmware signing keys. That is not quite correct. The keys they stole were the Intel Firmware signing keys, and that is an indication of a complete misunderstanding of security through asymmetric keys. Security can be achieved by using symmetric or asymmetric keys, and using asymmetric keys implies that there is one key to encrypt (or sign) something, and another key to decrypt it. The encryption/decryption operation requires a matched key pair. One key is called the public key because it is on purpose public so that everyone can decrypt something that you have encrypted (or signed) to protect it against impostors. In other words, someone uses a private key to encrypt (or sign) a document or a piece of Firmware or Software to make sure that no one other than he or she can encrypt (or sign) an item that can be decrypted by everyone using the public key. If someone encrypts something with an invalid private key, your valid public decryption key will not be able to decrypt it or verify the validity of the signature. This is particularly important for something as fundamental and basic as the BIOS or UEFI Firmware for your computer. You want to make sure that you do not install some malicious Firmware that someone messed with, and so you need to rely on the secrecy of the private key for the encryption of the Firmware. As a holder of a private key, you must make sure that this key never gets distributed or leaves the house or room.

Well, Intel got that wrong and distributed the private Firmware signing keys to every manufacturer of Intel-based computers. MSI was one of them. Predictably, they got breached. MSI confirms security breach following ransomware attack claims. This is about the worst security breach possible, aside from the SolarWinds debacle. Anyone in possession of this Firmware signing key can now write a malicious version of the BIOS or UEFI for any Intel-based computer, distribute it and be certain that the user installs it thinking that it is legitimate.

In other words, this breach has legitimized the illegitimate. The Police have become the bankrobbers, and the bankrobbers are the Police. It is very difficult to get out of this conundrum. All users must now use a different way of authenticating a Firmware update, and once that is done, the decryption keys are hopefully changed such that after a successful BIOS or UEFI Firmware update, the old key pair is retired, and all subsequent Firmware updates can go back to normal. However, this is the best case outcome. In reality, a good portion of users do not update their Firmware routinely, and only do so if and when something does not work any more. In that case, they will not be able to install the new legitimate Firmware as the keys have changed, and they will be greeted with an “Invalid Firmware” message.

It is going to be up to Intel and MSI to fix that. How much trust can we put into that? It is not that it could not have been predicted. In fact, I wrote patents that were intended to avoid exactly this situation, and in the intro, I pointed out the vulnerabilities of the current way to deal with unauthenticated Firmware.

The big question arising from this debacle is of course if a better Trusted Platform Module (TPM or a virtual version, vTPM) or a Root of Trust (RoT) coprocessor or a smarter and secure BMC could have prevented this. The answer to that question is an unequivocal No. The problem here is that the Firmware signing key was compromised, and all of the above measures rely on a valid and legitimate signing key. The only remedy would have a been a very different way for a processor to boot from its secure and encrypted Flash that is not predicated on a Firmware signing key. That method is described in one of my patents, and we are building a many-core processor that implements it. In fact, the newest version we are implementing in our Server-on-a-Chip is even more secure with additional safeguards for authenticity.

Twitter and NPR in a spat

Posted on 2023-05-03 by: Axel Kloth

Elon Musk had incorrectly tagged NPR as "government funded media". As a result, NPR decided to leave Twitter. Link to a Politico article here: NPR leaves Twitter. While that in and by itself is pretty bad, Musk decided to reassign the handle @NPR to another company.

The Hill expands on this here: Twitter threatens to reassign @NPR handle. I hope that NPR does not back down and in fact sues Twitter and Musk for this. Why? The Social Media companies have established a thriving parallel system to the USPTO and WIPO for trademarks.

A handle is like and comparable to a trademark, and as such that should be under the purview of the national and international patent and trademark agreements. NPR spent a lot of time and money to establish its brand, and that brand is NPR. I imagine that Elon Musk would not be happy if on Mastodon someone claimed @Tesla and @ElonMusk as theirs. Imagine the damage to the brand NPR if Musk reassigns that to the National Pumice Recyclers (I checked, they don't exist).

Why is this important? The patent and trademark system exists to protect intellectual property, and a trademark and name and handle belong in this category. This should not be up to private social media companies. IP and handles that are comparable to trademarks should be handled by the organizations that were set up to protect them.

The big CXL Conundrum

Posted on 2023-04-24 by: Axel Kloth

It is interesting to see that everyone agrees that both hyperscalers and supercomputers today rely on an outdated architecture for processors, accelerators and memory that does not seem to work well. It is even more interesting to see that the solutions suggested don't solve the problem, but create new ones or exacerbate old ones. One of these is CXL, the Compute Express Link. In short, CXL is a secondary protocol over the PCIe infrastructure. It is intended to allow memory - in most cases that will be DRAM - to be disaggregated from the server and its processors. PCIe is a high-latency infrastructure, and as such is not suited to memory attachment. The argument is that DRAM is expensive and should be a shared resource across servers and processors. On April 19, 2023, Micha Risling, the co-founder of UnifabriX states where he thinks that CXL fits into the future of memory in the data center. The article CXL is Ready to Reshape the World’s Data Centers even mentions the latency problem, only to go on to ignore it entirely.

If CXL is used to replace non-shared SATA Flash, then it can be made to work as CXL still has lower latency than a SATA- or SAS-attached disk, even an SSD. The problem arises when that CXL-attached memory is shared. If it is shared, then a coherency mechanism must be present to ensure that shared data has not been invalidated by a prior write access coming from a different processor. To ensure coherency, mechanisms such as MESI, MOESI and directory-based approaches exist, but all of them rely on a lookup for validity first. In other words, before a data set is read, a read access to a directory or the MESI/MOESI bits for that data set is needed to check if it is still valid, or if it has become invalid due to a modification from another processor which had fetched that data set but had not had the time to write back the modified data. If the data is still valid, then a read access can be executed while locking that data set copy in the shared CXL-attached memory to other accesses from other processors. Obviously, the more processors (including the many cores in current processors) have access to this shared memory, the higher the percentage of time during which the data set is not accessible, invalid or locked. Since CXL is such a high-latency infrastructure, the metadata traffic and the lockout times due to the long round-trip times will be a significant portion of the memory access times, and the usefulness of CXL-attached memory will be greatly diminished. In other words, sharing memory over a high-latency infrastructure such as CXL does not solve the problem; it will instead create new ones. It will be even more exacerbated if the memory is shared in an appliance that contains internal CXL switches.

In other words, CXL is another protocol on top of PCIe and as such has the same latency problems. CXL effectively can only carry non-coherent memory traffic.

Why is latency such an issue? I am going to simplify the situation a bit, but in essence this is what happens if a CPU (or more precisely, the processor core and its L1/L2/L3 cache) cannot access memory contents that it needs to continue working. It needs to stall, switch tasks or go to sleep. In either case, no work gets done, unless a task switch is possible with context saved to cache and context from another thread being retrieved from cache without using DRAM. All data fetches cost energy, but task switches by themselves are not executing user code. The CPU can only continue to execute user code if in fact a context switch is possible with valid data already present in one of its caches. The higher the latency to and from DRAM, the larger the caches have to be, and the more hierarchies of caches have to be present in a processor. Large caches with their TCAMs and all external inefficient I/O such as SSTL-2 are the biggest power hogs. In other words, very large shared contended and blocking DRAM accessible through a high-latency infrastructure such as PCIe and CXL enforce an ever-growing need for more caches.

Accelerators in HPC for Beginners

Posted on 2023-03-09 by: Axel Kloth

Whenever I get asked what HPC is I need to find an analogy. The analogies I use most are as follows: The CEO of a company does not do any of the work that leads to the ultimate product from that company himself. The CEO hires and directs people who create the product. He or she supervises the hiring, the training, the work and workloads and quality of all involved parties, to make sure that the product is built to the specification that the customer wants. Sometimes even that is too abstract. In those cases, I try a different approach. A conductor does not play the music himself. He or she hires all necessary pieces of an orchestra and directs them to play the musical piece. He or she simply hires and supervises the execution of the play. In the same fashion we see programmable elements in a supercomputer direct the workloads to be executed by specialized accelerators. They are less flexible, sometimes not even programmable, but they are much faster in executing a task, use less energy, and use up less space on a chip, and on top of that they are much more robust and usually not vulnerable to attacks from hackers.

HPC at Crossroads

Posted on 2022-07-07 by: Axel Kloth

It seems as if there is a confusion around the future of HPC, GPGPUs, special-purpose accelerators for AI (mostly the ML training part) and Quantum Computing. I have written up a short summary on where the industry is going, and Startup City has published it so that readers can familiarize themselves with the concepts, the outlook and the technologies needed. The article High-Performance Computing at a Crossroads hopefully clarifies some of the misconceptions. Abacus Semiconductor Corporation is working on processors, accelerators and smart multi-homed memories that can carry over digital bulk-CMOS technologies with improved system design over current processors until general-purpose Quantum Computers are available and affordable to solve the computational challenges of the future. As a Venture Partner at Pegasus Tech Ventures it is my responsibility to look at startups in this field and evaluate if they can advance the state of the art.

Malware so far in 2022

Posted on 2022-06-29 by: Axel Kloth

It seems as if there is no letting up on malware. While in prior years we saw phishing attempts and redirections to phishing sites coming from Russia, North Korea, the Chinese Academy of Military Sciences, plenty from India and a few each from South America and from Iran, this year seems it is dominated by Russia. Particularly active was root@validcapboxes8.pserver.ru using multiple aliases off the same server. These were mostly fake loan payoff notices, fake quarterly financial results and fake annual statements. All of these files were MS Office files with embedded macros, renamed to appear as PDFs. I stopped counting the expiration notices of my email account and its password, as all of them also came from Russia. Except for one yesterday, coming from Iran. Nothing from China or North Korea, India or South America this year so far. It is pretty annoying, and that they can’t be caught is somewhat disturbing.

FMS in 2022 as an in-person Event

Posted on 2022-06-29 by: Axel Kloth

The Flash Memory Summit is back to being an in-person event for 2022. While I am not presenting or organizing a panel this year, I am still on the Conference Advisory Board. Check out FMS for 2022, its agenda and its CAB!

Broadcom acquiring VMWare

Posted on 2022-06-15 by: Axel Kloth

Broadcom has announced that it is buying VMWare. Broadcom is a fabless semiconductor company that had a historic focus on communication ICs and particularly switch fabrics (the Tomahawk series in particular). It was acquired by Avago, which in turn was a spinoff of HP/Agilent Semi. Avago renamed itself Broadcom after the acquisition had completed. While switch fabrics are still part of its core business, Broadcom has tried to diversify itself in the past 5 years. It first acquired CA (formerly Computer Associates) and then Symantec’s enterprise division (this is now called Norton LifeLock). It is unclear to me where the synergies in these acquisition are, and if there is any cross-pollination of technology between those units. The same holds true for VMWare. Broadcom does not make the server CPUs that power the hyperscalers, nor does it make smart NICs or DPUs (Data Processing Units) as they are called today. VMWare would benefit from server CPUs with virtualization hardware support, which Broadcom does not make, and it would benefit from smart NICs with support for IOMMU tasks and any hardware-assisted protocols between server CPU and NIC as well as NIC-to-NIC protocol offload. Those would be synergies that create value for customers – but Broadcom does neither. As such, I can only see a sales channel that Broadcom offers. The question then is if VMWare needs a different sales channel.

That acquisition of course leaves an upside for anyone who starts a virtualization company today.

Scientists are cracking HIV

Posted on 2022-06-15 by: Axel Kloth

I keep being asked the question what supercomputers are good for. Further down in the blog, I had written up a list of applications that are typically deployed on supercomputers. The newest one that I found was that Supercomputing helps reveal weaknesses in HIV-1 virus, and usually finding weaknesses in any adversary leads to exploitation of that weakness, and ultimately the elimination of that adversary.

Now I am waiting for the common cold, the flu, allergies such as hay fever and a whole bunch of other ailments to be looked at... I need to look up the numbers, but people calling in sick for these ailments costs the world economy a very large amount of money. If we can get rid of these things cheaply and quickly, that would save the world a lot of money that could be used more effectively somewhere else.

More vulnerabilities

Posted on 2022-06-15 by: Axel Kloth

It seems to me that the implementation of crypto engines is not going well. Hackers can steal crypto keys on Intel, AMD CPUs via ‘Hertzbleed’ vulnerability. Certain functions should not be executed in a CPU core, and instead they should be done in dedicated hardware. That makes design and verification easier, and it does not require frequency scaling with all of its vulnerabilities. Any weakness can be exploited. The more complex a system is - and software-based systems are more complex than hardware-based systems - the more attack surfaces will be present.

That is one of the many reasons that we do not execute any cryptographic functions in software, and we do not keep keys in easy reach of software.

GSA Silicon Leadership Summit 2022

Posted on 2022-05-14 by: Axel Kloth

I have attended my first large-scale in-person event ever since the COVID-19 pandemic broke out. The GSA Silicon Leadership Summit on May 12th at the Santa Clara Convention Center was not only well-executed, but also well-attended. Its title - New Horizons - was befitting the event and global developments. My key takeaways were all positive. The semiconductor industry is going to continue to grow, and it will hit the $1T revenue mark some time in 2030 or a few years thereafter. That is a fantastic achievement given that only a few decades ago this industry did not exist, and the then-CEO of IBM anticipated that there maybe a need for 5 computers worldwide. What a difference a predictable cost reduction makes on a market. Today, smart phones have more computational performance than those computers that the IBM CEO referred to. As usual the limitations are I/O, and it seems this time around we will see large-scale deployment of optical I/O directly out of processor packages a few years down the road. Novel memories are needed to support traditional processors using the von-Neumann architecture, and for non-von-Neumann machines, we seem to have a few new paradigms up our sleeves. We still must secure the Internet, and AI will be able to augment some Human Intelligence in an ethical fashion. The headwinds are the current trends towards de-globalization, trade restrictions and at this point in time, supply chain issues. These headwinds can all be overcome, and I am very positive on the semiconductor industry in general.

More on Intel Optane

Posted on 2022-05-10 by: Axel Kloth

In February I had found an article from Tom's Hardware pointing out the losses Intel had endured with Optane. My assumption was that Intel would continue selling the business version of Optane until they run out of stock since the JV between Micron and Intel for the production of Optane had been shut down, and the fabrication facility was closed and then sold by Micron. Intel has no other source for Optane memory, and as a result, when the stock is depleted, it's over: Intel has Optane chip hoard with no plans to develop tech.

The consumer-facing side of Optane was sold to SK Hynix America, which now rebranded it as Solidigm. Apparently, it is doing quite well under its new ownership. Tom's Hardware reports that Solidigm Unveils D7 Series Data Center SSDs: Up to 15.36TB, 7100MB/s - that is an impressive number.

More Firmware Attacks

Posted on 2022-03-24 by: Axel Kloth

There are more and more BIOS/UEFI attack vectors in the wild, and the problem is that those are not theoretical only in nature, these are actively exploited. While these newest links seems to indicate that Dell is more affected than others, it is not the case. I am not bashing Dell here, as most of the BIOS and UEFI code is common amongst all PC manufacturers.

I think that we have reached a point at which it is impossible to continue on as if nothing has happened. The traditional system architecture is flawed in so many ways that we need to rethink it. It is preventing the industry from achieving better (i.e. more linear) performance scale-out, from better integration of accelerators, and from vastly improved security. The old adage of "never change a running system" must finally be overcome.

Will the Cloud eat HPC?

Posted on 2022-03-11 by: Axel Kloth

The Cloud has come a long way from its first days and inception as EC2. Most companies these days use Cloud computing and technologies in one way or another. Abstracting from processor Instruction Set Architectures (ISAs) and using containers to allow dynamic shifting of workloads have all been invented for and in the Cloud. Many things we take for granted today were impossible to do just a few years ago, and that is an incredible accomplishment. Today, the Cloud is still someone else's data center as was the case 15 years ago, but new technologies have been introduced to make Cloud services more palatable. What has not changed much is the level of abstraction users get from the Cloud - in fact, if at all, the abstraction level went up, and there are more layers of abstraction (and therefore translation and compute) than ever before. While there are bare metal Cloud offerings, they do not provide the Cloud benefits, and the performance level is usually not what could be expected from an on-premise installation. There are attempts to solve this and bring HPC to the Cloud (or seen from the other perspective, extend HPC to the Cloud). The Next Platform posits a very valid question. Will HPC Be Eaten By Hyperscalers And Clouds?. We believe that this is not the case.

I have explained in my blog post "What is HPC anyways?" what the differences between the Cloud and HPC are.

While the Cloud and HPC systems are similar, they are not identical, both in terms of hardware used, and in the applications running on them. In short, in typical Cloud applications thousands of applications run on thousands of servers. In HPC, one applications with an enormous data set is running on thousands or tens of thousands of servers.

EU CHIPS Act update

Posted on 2022-02-10 by: Axel Kloth

PCGamer repports that Europe sets sights on global semiconductor domination. I had mentioned that the 11 Billion Euros that I read in the article are by far not enough, and I posted this here here. Turns out that I either did not read far enough through the entire legislation or the additional grants were hidden somewhere else, but very clearly the EU knows that 11 Billion does not pay for much. The real number appears to be closer to 70 Billion Euros, as PC Gamer sates: "To this end the act also includes the potential to invest €30 billion in building fabrication centres by 2030. This puts the total spend at around $70 billion USD over the next ten years." That is a different ballgame and should get Europe back into semiconductor manufacturing. I am glad that this is the case.

I believe that Europe will benefit from this agreement and money that is invested into the design and manufacturing of semiconductors.

Firmware Attacks

Posted on 2022-02-10 by: Axel Kloth

Current defenses against malware attacks usually rely on software running on an x86-64 machine. This software can be hosted in a firewall, in a dedicated server to detect endpoint-compromising attacks, inside a mail server to detect spam and phishing attacks, and in a wide varierty of other devices, including end user client devices. In nearly all cases, the integrity of the BIOS/UEFI and oftentimes the Operating System are inherently assumed. That assumption is dangerous since it's wrong. The Operating System can be compromised easily - no Operating System I know of is impenetrable. What is worse is the assumption that the BIOS or UEFI cannot be compromised. In every server, there is a BMC, and the BMC can update the host firmware. In other words, any attacker can circumvent Operating System provisions to protect the firmware by taking a short detour through the BMC. In most cases, this goes undetected, and the BIOS/UEFI modifications are persistent and even survive an Operating System reinstall.

I keep hearing that those attacks are hypothetical only, and that there are not many of them out there in the wild, and even if so, they have no relevance. Wrong. Here is short sample of what has been published lately.

Hardware must be used to avoid this, and the BMC route to updating the hosts' firmware must be made vastly more secure, ideally by using better credentials than just username and password. The technology to achieve this exists, and we have it. We call it Assured Firmware Integrity or AFI™, Resilient Secure Boot or RSB™ and Protected Shadow ROM or PSR™

Intel Optane

Posted on 2022-02-10 by: Axel Kloth

I just stumbled across this article about Intel's Optane at Tom's Hardware. They found out that Intel's Optane Business Haemorrhaged Over Half a Billion Dollars in 2020.

That is a lot of money, particularly taking into account that this started out as a Joint Venture between Intel and Micron in 2015, with high hopes to close the power and density gap between DRAM and NAND Flash. In 2021, Micron had called it quits and bailed out of the JV. As far as I am aware, Micron let all of the developers go that would not want to transition to the DRAM or NAND Flash groups within Micron. Micron also sold the 3D XP fab to TI, and I was under the impression that there was no supply agreement between TI and Intel. As a result, I thought that Intel had closed down all of its Optane business (which was the trademark Intel held for the 3D XP technology).

It certainly did not help that 3DXP was Phase Change Memory, but Intel chose to deny that it was. More importantly, 3D XP never fulfilled the promise of DRAM-like performance at NAND-Flash density and power.

Apparently, Intel continued to sell Optane-branded SSDs, but clearly at a loss, and with no upgrade path for the technology and the devices themselves. When the stock is depleted, I assume that Intel will simply shut down this brand and technology.

nVidia/ARM deal off

Posted on 2022-02-09 by: Axel Kloth

The official confirmation that the nVidia/ARM deal is off came in. Ars Technica reports that the $66 billion deal for Nvidia to purchase Arm collapses.

I am not surprised, and in fact, I think that this is a good thing. There will be some assurance that ARM will continue to be the Switzerland of processor and semiconductor IP of choice, and the ISA will continue to be somewhat of a lingua franca. However, long term a good number of ARM licensees will look for alternatives to ARM as an outcome of this ordeal. I had voiced my concerns early on.

I believe that RISC-V will substantially benefit from this.

EU agrees on a CHIPS Act

Posted on 2022-02-09 by: Axel Kloth

The EU is investing 11 Billion Euros into the semiconductor industry. Intel is investing $20B in the next few years, and TSMC is pledging $100B in the next few years. The US has a $55B CHIPS act, and it remains to be seen how much Korea, Japan and China are going to put up. The EU pledge is simply not enough to make a difference. It can be found here: Digital sovereignty: Commission proposes Chips Act to confront semiconductor shortages and strengthen Europe's technological leadership. It also comes right on the heels of the announcement that Margrethe Vestager, the EU’s Commissioner for Competition, declared that "achieving semiconductor independency is ‘not doable’".

Intel's Strategy on its competition

Posted on 2022-02-09 by: Axel Kloth

I think that by now I understand Intel's strategy with regards to ARM, nVidia and RISC-V. They all tie in together and must be seen as a whole.

Intel has understood for quite a while that they have a formidable competitor in ARM and in nVidia. The ensuing steps were brilliant from a business strategy perspective. It is in fact the old "divide and conquer" method.

Intel invested in SiFive as the first of the RISC-V commercialization plays to make sure that ARM's growth can be stunted. Intel had understood quite a while ago that it had lost its ability to compete with ARM in the cell phone, smart phone, tablet and low-end laptop markets. Intel had to have a way to stop ARM from dominating the industrial IOT (IIOT) market, and RISC-V is a perfect method to achieve that. RISC-V as architected and first implemented as RochetChip had all of the necessary ingredients to prevent ARM from completely dominating the IIOT market where Intel's x86-64 had no chance of competing. Investing in SiFive offered a way to steer SiFive into the IIOT market and help build out the ecosystem around RISC-V, particularly focusing on tool and IP development for the embedded and IIOT markets. This ensured that ARM would have a viable competitor in the IIOT market without affecting Intel's cash cow, the data center market.

That means that Intel could focus investments in fabs and in the data center market. ARM would be taken care of by RISC-V in ARM's native domain. In the data center market, Intel would only have to compete with AMD and with nVidia. Investment in RISC-V would also give Intel a strategy in case x86-64 really started being unable to compete technically. It was a cheap insurance against any surprises.

In case nVidia and more specifically CUDA were becoming too much of an economic problem for Intel, then a simple way to cut that off would be by investing in special-purpose accelerators, such as Cerebras, GraphCore or any others (including us) for evolving new requirements, and developing CUDA compatibility for all of the accelerators.

The termination of the nVidia and ARM merger is a boon to Intel as it forces current ARM licensees to rethink their strategy, and nVidia will likely return to the RISC-V table.

EU is giving up on becoming independent

Posted on 2022-02-07 by: Axel Kloth

The EU had grand plans to become independent of the US, the UK, China and anyone else in the design and manufacture of semiconductors. That goal alone highlights a complete misunderstanding of how semiconductor design and manufacturing works. First of all, tools are needed to design the products. Those tools are non-trivial to create and to use. Students have to be educated in their use. and proficiency must be achieved. Then, after the design phase, logical correctness must be established, with yet another set of tools. These tools again are non-trivial to write and to use. After the verification of the logical correctness of the design, the physical design phase starts, and that is non-trivial as well. These tools are specific to the manufacturing process, so they cannot be designed in a vacuum. This is a collaboration between the manufacturing plant or "fab" and the physical design tool designer, creating a PDK (or Process Development Kit). Once the physical design is done and all components are placed, it has to be verified that the logic design is reflected in the physical design's implementation, and whether the targeted clock frequency can be achieved. This is the dynamic timing closure phase, which can unravel a lot of the physical design because unlike for mathematical (Boolean) logic, light and electrical impulses do not travel at infinite speeds. If a signal path is too long, parts of the design have to be relocated. This is an iterative process that can take weeks and hundreds to thousands of hours of CPU time on a large-scale computer cluster. It was clear to anyone inside the industry that the EU would not be able to achieve full independence.

I had anticipated that the EU would declare that some manufacturing will have to be lured back to the EU member states, with incentives being paid and well-trained workers being made available through continued education paid for by the EU. I had also anticipated that the EU would declare a preferred CPU Instruction Set Architecture (ISA) that must be used for all military and crucial infrastructure projects. RISC-V would have been a perfect choice.

I have been wrong, as I have been so many times when it comes to predicting actions that politicians take.

Margrethe Vestager, the EU’s competition chief, declared complete and unconditional surrender. Achieving semiconductor independency is ‘not doable,’ EU competition chief says.

No preferred ISA, no preferred High Level Language (HLL) for semiconductor design, no embrace of tools such as CHISEL, no luring back semiconductor fabs to Europe, no on-the job retraining of engineers to target semiconductor design. Not even trying to retain processor design talent so that at least the design of processors and accelerators remains a possibility in Europe. Nothing. I'd call that a complete and unconditional surrender.

Entirely giving up semiconductor independence of course also means giving up on leading High Performance Compute (HPC). In other words, the EU will continue to use US or Chinese processors and accelerators and memory to power their next-generation Supercomputers. Who says that there are no back doors in there?

Apple prove that the ISA is not relevant

Posted on 2022-02-07 by: Axel Kloth

I keep hearing that the Instruction Set Architecture (ISA) is important, and that without binary compatibility of our processors to the Industry Standard we have no market.

That is a gross misunderstanding of how things works these days.

First of all, Apple has changed processors and ISAs multiple times now. Which processors and ISAs did Apple use over time? Apple started out with the 68000 from Motorola. When that ran out of steam, Apple changed to POWER/PowerPC (IBM). IBM discontinued that product line, so Apple switched to x86-64 from Intel. When Apple saw that Intel promised better performance but at ever-increasing levels of power consumption, Apple had to find a new way to improve performance on the same trajectory as Intel promised with x86-64, but with a decreased level of power consumption. That required a different ISA and a different manufacturing process, and so Apple switch to ARM Processors that were designed in-house and fabricated at TSMC. These developments became the A and M series processors for the iPhone and the Macs.

None of these processors share an ISA.

Every single time Apple changed ISAs, there was an outcry from people who did not know any better that such a switch would be devastating, and that it could not possibly work. Every single time it went without large hiccups, largely due to the fact that the Operating System (OS) is not written in the processors' assembly language, but in a higher-level language, typically in C. That means that code rewrite is limited to very small portions of the OS. With ever-better compilers such as LLVM/CLANG, recompiling the rest of the OS becomes a fairly manageable task.

Four ISAs over time. No substantial problems.

In other words, the ISA has become less relevant.

Another change that took place is the web, or more specifically, the xAMP stack. The xAMP stack is a software stack comprised of an Operating System (usually Linux for the LAMP stack, FreeBSD for the FAMP stack, and Windows for the WAMP stack), Apache as the web frontend, mySQL as the database, and PHP/Perl/Python as the scripting langauge. The Internet is built on and predicated on the xAMP stack. Nearly everything in the "backend" of the Internet runs on top of a xAMP stack. How much?

According to Pronskiy, PHP runs PHP runs "78 per cent of the Web," though the figure is misleading bearing in mind that this is partly thanks to the huge popularity of WordPress, as well as Drupal and other PHP-based content management systems. PHP is some way down the list of most popular programming languages, 11th on the most recent StackOverflow list, and sixth on the latest GitHub survey, down two places from 2019.

If PHP runs 78% of the web, then by the very definition of the stack it must run on 78% of all web servers. So 78% of all web servers run the xAMP stack.

In other words, if we have LAMP, FAMP or WAMP running on any given server with any CPU that supports this stack, we cover 78% of all servers and/or traffic.

Linux and FreeBSD run on RISC-V as of today. Apache is ported, and mySQL has been for quite a while, and PHP even with the JIT is allegedly running. In other words, LAMP and FAMP run on RISC-V as of today.

While works are underway to port better databases than mySQL (such as ScyllaDB, Cassandra, postgreSQL or KeyDB), they are not necessary, for RISC-V to run any web applications. RISC-V is not a niche product for which drivers are hard to come by. Drivers can be an issue for an RTOS in an embedded device, but today RISC-V is already running Linux and FreeBSD with all necessary drivers.

What is HPC anyways?

Posted on 2022-02-03 by: Axel Kloth

I keep getting questions about HPC and Supercomputers that make me think that the industry has not done a great job in explaining what HPC is, why Supercomputers are needed, and why a Supercomputer is not just a hyperscalers' data center.

Let me first answer what HPC is. HPC is a desciption for a segment of compute that deals with very large-scale problems. Weather forecast with a good accuracy over more than 5 days still is an HPC problem. The number of input parameters and the size of the volume elements largely determine the computational effort and the accuracy of the result, particularly if the result has to precise enough to take action 5 days out or more. Climate modeling is another application for HPC. Any large-scale Finite Element Method (FEM) being used to statically or dynamically simulate the behavior of a system falls in the category of HPC. Crash tests for cars under development that simulate the behavior of a car in an accident are HPC. Studying how to contain the plasma in a nuclear fusion reactor certainly qualifies as HPC. Understanding how the Corona virus reacts with cells is an HPC application.

HPC applications are usually executed on one or more Supercomputers. Why? If a large-scale problem must be solved, we can either use a very large computer cluster with many CPU and accelerator cores to solve the problem in reasonable time, or if we have too much time at our hands, we can use a small cluster of servers and wait for weeks or months or even years for the result.

So then why don't we use "The Cloud" as a supercomputer to solve HPC problems? First of all, "The Cloud" really is someone else's data center - in most cases, it will be Google's or Amazons' or Microsoft's servers in a data center with tens, if not hundreds, of thousands of servers. In "The Cloud" tens of thousands of applications for tens of thousands of users run on tens of thousands of servers, using Kubernetes to make it possible to shift workloads. The individual workloads are small, and so is the need for disk or network I/O per user or per application. There is very little need for those applications to communicate with other user's applications. As a result, there is not much need for low-latency communication between servers. That led to the development of containers and Kubernetes, which allows an even higher level of abstraction and now allows to move workloads from one server to another, in case of failure or if that server starts being overloaded and response times increase. In other words, in "The Cloud" the server, its CPU and its memory as well as local disks are the performance-dominating parts. The interconnect plays a very small role.

In a Supercomputer, the processor, accelerators, memory and local disk are important, but since we know that the computational problem is very large and requires thousands of servers work in concert to solve the problem, the interconnect plays a crucial role. Imagine an employee sitting at a desk, and he or she shares the work with a colleague across the desk. If there is any question, the employee can immediately get access to the colleague and ask the question and clarify whatever needs to be explained. That is a very low latency and high bandwidth interconnect. Now imagine the colleague is on a different floor. As a result, the employee has to get up, go to the other floor, and then find the colleague and ask the question. We may still have the same bandwidth of communication between the two employees, but the latency has increased. That has an immediate impact on the granularity of the tasks that can be shared. To get up, go to another floor and find the colleague takes enough time to reassess if it is worth the effort, or if one should try to solve the problem by oneself. Only if the problem is large enough to justify the time lost finding the colleague would the employee try to farm out the problem to another employee. That is exactly the same in a Supercomputer. The higher the latency between two processors or processor and accelerator or their memory, the larger the task has to be to be farmed out to make sense. The aggravating factor here is that high latency even impacts metadata such as the simple question whether the other processor is busy, or if it can take on the task in the first place.

In other words, in a Supercomputer the interconnect plays a crucial role that is due to the fact that the workloads are different from a data center at one of the hyperscalers. That is why Supercomputers or HPC as a Service will have to wait until there is a unified processor, accelerator and memory architecture that can serve both purposes at a cost comparable to today's industry standard machines.

No more VPNs needed?

Posted on 2022-02-02 by: Axel Kloth

Tom's Guide reported that Security experts say you no longer need a VPN — here's why.

When I read that, I have to say I was perplexed. I was surprised not only by the author, but also by the security experts. The argument that the security experts made was that all traffic is encrypted anyway, and therefore the need to use a VPN (which encrypts traffic and provides you with a trusted DNS from your VPN provider, your home or your own DNS Server) is not there any more.

Needless to say, I fundamentally disagree. Your traffic is only protected if you access only and exclusively sites that use SSL/TLS and are signified by the lock in the address bar, and they start with https:// and then follows the URL. However, if you don't conduct your email with web mail clients or use FTP or any protocol other than secure http, then your traffic is all in cleartext. WLAN snooping will allow anyone to conduct a snooping operation, or worse, insert himself/herself as a man in the middle in a MITM attack. That's simply not acceptable because a lot of the work we do remotely is non-https traffic. Leaving all of that in cleartext is dangerous.

It seems like they did not quite trust their own advice either, since there was a qualifier at end of the article under "How to protect yourself without a VPN", and then this gem was included: "Set up a private VPN server on your high-end or gaming router, or "flash" a cheap router with free firmware like DD-WRT or Tomato, so laptops and mobile devices can use your secure home broadband connection while out of the house". So you protect yourself with a VPN without using a VPN, and VPNs are generally useless these days. This is not only circular logic, this is plain and simple illogical and bad advice.

My advice: ignore Tom's Guide's advice and use a VPN server in your house or business, and then install the VPN client on your phone and laptop, and you will be safe anywhere. VPN servers are cheap and easy to configure these days. Check out my FOSS recommendations here.

nVidia's takeover of ARM in trouble

Posted on 2022-01-27 by: Axel Kloth

When nVidia proposed to acquire ARM, I had my doubts on multiple levels. ARM China is a hot mess, and there is no resolution in sight. As an nVidia subsidiary, the licensing terms and conditions for anything from ARM (not only processor cores) would have changed, and that would have put startups in a bad position if they banked their existence on ARM IP. It likely would have also impacted the large licensees such as Apple, Qualcomm, Samsung and many others. Now, it seems the deal is in trouble, and according to Bloomberg, Nvidia Quietly Prepares to Abandon $40 Billion Arm Bid.

I see that as a net positive. If that acquisition fails, nVidia will return to the RISC-V table, and it will help grow that ecosystem. That means that we are left with x86-64 for servers and desktops, ARM for smart phones, feature phones, tablets and some server processors, and possibly the laptop market. It also helps RISC-V as ARM won't be as dominating as it is now. I can foresee RISC-V in the edge compute market, in laptops (provided that someone builds yet another beautiful and easy-to-use GUI on top of FreeBSD), and in scaled-out HPC, which is what we are doing.

In other words, we won't see ARM take over as the next Intel when it comes to ISA (Instruction Set Architecture) monopolies. That's a good thing, despite the fact that ISAs today do not carry the same importance that they did 20 years ago.

Here are the links to my blog entries highlighting the problems I saw: FTC opens probe into nVidia and ARM merger, nVidia and ARM merger hits roadbumps, Apple and ARM and nVidia buying ARM.

A flat tire is not a software problem

Posted on 2021-12-08 by: Axel Kloth

I hear more and more often that hardware does not matter. Software will solve all of the world's problems, and the AI generation modified that to "AI will solve the world's problems". Well, no. Plain and simple: this is wrong. The Instruction Set Architecture might not matter as much. But hardware matters.

Software and even AI (which of course uses some software and lots of APIs) runs on some hardware of some type. Be it a general-purpose CPU, a special purpose processor or coprocessor, or an accelerator of some type. It's hardware all of this stuff runs on. With GPCPUs, GPGPUs and most accelerators built based on some premise from 20 years ago, we need to re-evaluate that hardware. That includes the CPU, all of main memory and mass storage, accelerators, interconnects and how we deal with I/O, DMA and Interrupt Requests.

A flat tire is not a software problem. You can spend dozens of hours hacking the car's tire pressure monitoring system (TPMS) to allow it to continue to drive, but in the end you will ruin the tire, the rim and eventually bottom out on the brake disk, then the wheel hub, and then the frame or unibody frame members of the vehicle and ultimately destroy them in that order.

A flat tire is a hardware problem that needs to be fixed.

I just watched a Youtube video explaining the radix sort. We all know that Google and Youtube know everything that can be known in this universe, but I had a healthy laugh. Why? Because the premise is to not compare and instead create new lists in DRAM, read from those lists, and even use pointers in DRAM to list elements in DRAM. While the total number of operations required to sort is in fact vastly lower between the radix sort and quicksort and bubble sort and other sorting algorithms, guess what is the slowest thing you can do in today's computers? If you guessed DRAM reads and writes, you'd be correct. One of the fastest operations a CPU executes? If you guessed a compare, you'd again be right. So... very clearly the software developers have not talked to the processor designers in 10 years. Or maybe 20. It is time to fix that. CMP or BNE are fast. DRAM reads and writes are not.

Democratizing Chip Design

Posted on 2021-11-29 by: Axel Kloth

Mike Wishart and Lucio Lanza in EETimes explain why chip design experiences a renaissance. They claim that The Democratization of Chip Design leads to many new entrants into the IC and processor design spaces. To a degree that is correct as the barrier to entry is lowered with new languages that are vastly easier to understand. For example, CHISEL and Scala are used to generate a RISC-V processor with all of its peripherals, and no Verilog or VHDL are required to write a RocketChip RISC-V processor. However, Verilog is created out of the generator languages, and that needs to be understood and modified and integrated into the rest of the design. While I agree with Mike and Lucio that in fact it makes a whole lot of things a whole lot easier, I am not sure if we will see more entrants into the realm of processor design. We have been using CHISEL, Scala and RISC-V since 2012. During that time, we did not experience many newcomers. What I can envision is that many more companies will sprout up that develop microcontrollers for many more special-purpose applications. Why? With the design of the processor core out of the way, all kinds of accelerators and peripherals and I/O ports can be designed with relative ease, and that is what microcontrollers are: a processor core with just enough performance and an industry-standard Instruction Set Architecture, and lots of other IP around it. That IP around it can be written in Verilog, in VHDL, in Scala/CHISEL or in any other language that is fit for the purpose.

The Significance of the xAMP Stack

Posted on 2021-11-29 by: Axel Kloth

The Internet has been around for the average user for more than 20 years now.

A good portion of its success was that it established a homogeneous platform on the server-side backend upon everyone could build additional applications. This platform is what is called LAMP, which is an acronym for Linux, Apache, mySQL and PHP. While alternatives for each component exist (there are alternatives to the underlying Linux OS such as Windows and FreeBSD as well), Linux prevails and is the most-often used OS in the backend. Apache has had a few competing solutions that focus on better performance or scalability, such as nginx. The same is true for mySQL – more modern databases including in-memory databases - have sprung up and can replace mySQL in the LAMP stack. PHP (and Perl) form the foundation of the software running on top of LAMP. Both are interpreted (scripting) languages. In other words, for as long as applications rely on structured query language commands compatible with mySQL, use the same protocol as Apache, and can execute PHP (or Perl) commands, the very vast majority of the applications of the Internet stack will work without modifications, and since everything on top of the LAMP stack relies on interpreted languages, they are ISA-independent. As a result, Internet applications will work on x86-64 processors, on ARM or on MIPS processors, and on RISC-V processors without recompiling. The processors’ Instruction Set Architecture (ISA) has become less relevant with the lingua franca of the Internet.

Recompiling is only needed for the basic applications that make up the LAMP (or WAMP or FAMP or xAMP) stack. In essence, that means that applications that make up the Internet backend will work on any processor once the xAMP stack has been made available by compiling towards this processor architecture or ISA. Any new processor ISA that is supposed to be used in internet backend applications therefore only needs to provide an Operating System, Apache or a compatible web server, mySQL or a compatible database, and a PHP interpreter. With those few components that can easily be created from the open source repositories with an appropriate compiler, a xAMP stack can be provided such that this processor architecture can be deployed in servers on the Internet backend.

Modern compilers such as LLVM/CLANG can even be used to allow a processor to execute a different processors’ ISA. In other words, Apple’s M1 Pro in conjunction with LLVM/CLANG can execute native x86-64 instructions, which may be necessary if a native application is not yet available. Depending on the quality of the emulation and its hardware support, a processor might be able to execute a different processor's ISA in near real time.

Understanding the Human Brain needs HPC

Posted on 2021-11-29 by: Axel Kloth

I keep hearing that HPC is too abstract and no one understands why we need it. I am not quite sure how to answer this. There are so many applications for HPC that go unnoticed that I can understand why they are not on top of the mind of the layperson, but in reality they have an impact on everyone's life. Weather and climate forecasts, crash tests, computational fluid dynamics and most bio-engineering research are HPC applications. The human brain is another one, so if you can't wrap your brain around it, then it is because it is being simulated for researchers to understand it better!

An excellent overview of what that kind of research does and what it can accomplish is summarized here at Human Brain Project: Researchers outline how brain research makes new demands on supercomputing.

Pat meets resistance at Intel's reorg

Posted on 2021-10-23 by: Axel Kloth

It looks like Pat Gelsinger is doing all the right things at Intel - and predictably runs into resistance, both internal and external. I bet a lot of analysts don't like the new strategy, and unfortunately, too many investors listen to analysts. In my opinion, analysts are Monday morning quarterbacks. They have no insight and no responsibilities, take no risk, but feel free to criticize after the fact. They are wrong more often than those who have to run a business and make decisions.

Among others, Pat has reorganized the HPC group (Intel Confirms Damkroger Out as Head of HPC; McVeigh to Lead Newly Formed Super Computer Group) after splitting it up in two: Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups.

Cooling Technologies for Data Centers and HPC

Posted on 2021-09-27 by: Axel Kloth

I have tried to figure out the power consumption of the totality of today's data centers, supercomputers, the Internet backend, the Internet itself with its Points of Presence, the last-mile providers such as ComCast and AT&T and their international equivalents, and the numbers I have found indicate a staggering degree of uncertainty.

The numbers I found vary between roughly 1% and close to 7% of the worldwide generation of electricity, and that is not even including the power consumed by the miners of bitcoin and the like. There are not quite 200 countries on this planet. If it is 1% then this would mean that the power consumption fueled by our digital habits is higher than the power consumption of a good number of entire countries. If it is approaching 7%, then that would make it so large that entire clusters of countries (such as for example all of Northern Africa) use less power than what's needed to run our digital economy. While I am as guilty as anyone else of contributing to this, it occurs to me that we must do something to cut that power consumption back. A good portion of the power needed for data centers revolves around cooling, and to me, that seems like it is the easiest part to address quickly.

Today, data centers are cooled via horizontal movement of air through servers and top-of-the-rack switches, and subsequent vertical movement of that hot air to the ceiling, where it is then extracted and cooled. That's the dumbest way of cooling as air does not transport heat well, and heated air has a lower density than cold air and as such wants to rise. At the very least data centers should adopt the former TelCo standards for racks with no horizontal boards allowed to let air rise vertically through the rack, from the raised floor providing cool air to the ceiling where heat is removed. Ideally though, the industry converts to liquid cooling. As The Next Platform mentions, Liquid-Cooled Systems Are Inevitable, But Not Necessarily Profitable. The industry seems to be so unwilling to change that even some established players retreat.

HPCWire reported that Asetek Announces It Is Exiting HPC to Protect Future Profitability from liquid cooling systems for HPC and refocuses on consumers and data centers only. According to HPCWire, "Asetek has been a mainstay provider of warm water, direct-to-chip liquid cooling technology in use at HPC sites worldwide, partnering with companies such as Cray, HPE, Fujitsu, Supermicro, and Penguin Computing." While allegedly CoolIT has taken up that slack and established companies such as Clustered Systems compete for the existing markets, new entrants such as Ferveret try to convince the industry that immersion cooling is the way to go. Certainly Ferveret will have to weigh the advantages of immersion cooling with the demerits of harmonizing server mainboards against competiting technologies.

Data Security

Posted on 2021-09-27 by: Axel Kloth

Securing everyone's data is going to be a herculean effort for a variety of reasons. First, the industry has not really put a focus on data security, despite claiming otherwise. Second, there is a fairly fundamental misunderstanding what data security is. Third, the attackers are getting better and as far as I can tell, learn faster and adopt new strategies more rapidly than those that aim to keep the data secure and private.

There are multiple reasons for all of the above. First all of, data is growing exponentially, and that poses a problem. The number of servers in data centers simply requires automated OAM&P, and as such, tools that help administer hundreds of thousands of servers with one set of credentials for a super-admin. Second, computers and operating systems have reached a level of complexity that makes it nearly impossible to make them impenetrable. As a result, it is always going to be easier to find holes and exploit them than it is to write watertight software on top of un-breachable hardware. Third, the potential payouts and the number of attackers - including nation states - keep growing.

We are also witnessing that the focus – if data security is discussed at all – is on either data at rest or on data in transit. That’s akin securing the vault in the bank and securing the transport vehicle for the cash and other monetary instruments – but not looking at how to keep the drivers of the armored vehicles and the bank tellers safe. In other words, we try to keep data safe when stored and transported, but not the devices that enable and provide that protection. Breaches are very prevalent and will only be more frequent and more devastating when 5G with its curbside and physically unprotected servers will be ubiquituous. I'll call this Kloth's Fourth Observation: The devices that are intended to keep data at rest and data in transit secure are unprotected and must be secured. So far, the industry has simply ignored the problem of computers and networks being breached. While SANS and MITRE keep a database of vulnerabilities and exploits, we do not see that the manufacturers of operating systems, firmware and even hardware (computers, memory, processors and ancillary chips) have fundamentally re-thought how to protect the computer itself. The thought that encrypting data protects it is simply too short-sighted. If uncrackable encryption is used to protect the data at rest, it is not secure because once the computer has access to said encrypted data, it must decrypt it to use it. Any kernel process in the OS running concurrently with the legitimate process that deals with the now-decrypted data can access it with little difficulty. Unless the keys are unknown to the operating system, such a process might even be able to steal the keys, making it even harder to defend against threats. The military has known strategies to protect theaters for a long time, and the main mantra here is that before you can protect others, you have to be able to protect yourself. Conversely, we must make sure that computers can protect themselves so that only authenticated processes run, for authenticated users, and at times that are defined in service level agreements.

We have developed technology that enables the processor to defend itself against attacks so that it then can protect data at rest and data in transit. This technology is built into our Server-on-a-Chip and into our HRAM.

Legal use of encryption at risk again?

Posted on 2021-09-24 by: Axel Kloth

It seems like every single time an FBI Chief is under duress for not doing his/her work, they deflect by insisting that encryption should be outlawed.

Following tradition, Chris Wray is now doing the same. FBI Director: Ban Encryption to Counter Domestic Extremism. Under pressure for not doing his job with regards to the vetting of Brett Kavanaugh in preparation of the Senate Judiciary Committee hearing and for dropping the ball again on the Larry Nassar case he brings up outlawing encryption. Like before, this is ridiculous, and this man does not know what he is talking about, let alone what he should be doing. There’s No Good Reason FBI Director Chris Wray Still Has a Job. If encryption is outlawed, regular citizens can't protect themselves, but criminals will not be threatened by civil or criminal penalties for using it. Simple logic dictates that. If a criminal expects that if caught and convicted he/she will face 30 years in federal prison, then what would a few thousand dollars in penalties and fines or 3 months in local prison do as a deterrent? Exactly nothing.

I had alluded to this in my old SSRLabs blog on 2020-10-12 under the title "US DoJ on Encryption — again", copied verbatim:

Yet again the US Department of Justice (DoJ) tries to pitch End-To-End Encryption against Public Safety. The reality is that the opposite is true. There is no Public Safety without End-To-End Encryption. Predictably, the DoJ brings up exploitation of children to justify restricting the use of encryption. Encryption relies on secret keys or key pairs. The algorithms are standardized. For backdoors to work, a repository of keys and key pairs has to be created. This database will be the most-targeted piece of property ever, as it would reveal all keys from everyone to everyone else using encrypted communication. Whether this database is a collection of databases by each provider or a centrally and federally managed database does not make a difference. It will be breached. I do not want to go into any more detail here, and anyone who wants to dive deeper is invited to ping me. I promise to return email requests. I'd like to make it very clear: Backdoors to encryption are not needed and are dangerous. This renewed attempt of pushing legislation through that restricts encryption must be stopped.

and here on 2015-07-16 under the title "Encryption at Risk?", again copied verbatim:

I am not quite sure what to think of the recent statements that the director of the Federal Bureau of Investigation (FBI), James Comey, has made. According to The Guardian, James Comey, FBI chief wants 'backdoor access' to encrypted communications to fight Isis. To me it looks like he is looking for a justification to first ban and later on outlaw strong encryption without backdoors. This is confirmed reading the statement right from the horses' mouth here: Going Dark: Are Technology, Privacy, and Public Safety on a Collision Course?. Newsweek confirms this interpretation here: FBI's Comey Calls for Making Impenetrable Devices Unlawful. Well, I am not a fan of backdoors. I think that encryption is good and backdoors are bad. The reason for that is very simple. Strong encryption protects you and your privacy. You do not send a piece of important information on the back of a postcard - you put it into an envelope. You do not hand this envelope to Shady Tree Mail Delivery Brothers to get it to the recipient. You drop it into a mailbox of the USPS, Fedex, UPS, DHL or the like, expecting that they do not open the envelope. With the delivery contract, you have a reasonable expectation of privacy. On the Internet, there is no expectation of privacy. If you want something to be delivered such that no one in the path of the transmission from you to the recipient can read the contents, then you need to be able and have the right to use strong encryption to ensure that despite the open nature of the Internet no one can snoop. It also should be up to you to determine what is worthy of protection and what not. If I send an email to a supplier asking if they would like to do business with me, then I do not need any encryption. However, if they agree and they send me back a quote, they sure do not want their competitors to be able to intercept and evaluate their quote and possibly undercut that quote. They have a reasonable interest in protecting their quote. Now let's assume that we have a new law in place that allows strong encryption but requires you to accept a backdoor into your encryption with the backdoor keys being held at a government location. Why is that a bad idea? Well, for starters, the biggest focus of any hacker will be this repository of keys to the backdoors. Any hacker on the planet - good or bad, capable or incapable, ethical or not - will attack this repository. Brute force attacks and social engineering and many other attack methods or simply sheer luck will be used to get in. It is unrealistic to assume that such database can be protected, and it is naive to pretend that a mechanism providing a backdoor cannot be exploited. If history has proven anything then we must assume that encryption with a backdoor is useless as both the backdoor mechanism itself and the centralized repository for the backdoor keys are vulnerable and will be cracked. We know that the likelihood to break into the repository of keys for the backdoors is 100%, no matter how protected this database is. With the repository of keys to the backdoors in an unknown number of unknown hands encryption becomes useless as any crook and any unethical person has access, and the ethical and good people are being betrayed. That's akin to putting every criminal on the streets and every law-abiding person in prison. Is that what the US government and the FBI want?

To me, it seems like the US needs another amendment to the Constitution, explicitly declaring the use of encryption legal. I am sick and tired of explaining secure communication to people in power without any understanding of technology and its implications. Again, even secure communication will have to have its metadata in plain text visible to any observer, and as such metadata is enough to catch and convict criminals. Insight into the ciphertext is not needed. After all, we do not outlaw the use of letters and only allow the use of postcards.

Apple is looking into RISC-V

Posted on 2021-09-07 by: Axel Kloth

Apple is looking for RISC-V designers according to Tom’s Hardware. It certainly took Apple a while, and I had predicted that it would happen here. It just amazes me that it took so long. After all, Apple was one of the first investors in ARM when they decided that the Apple Newton was a good idea, and put an ARM 710 into the PDA. I had one of those, and while the idea was great, it simply needed vastly more computational performance than the ARM 710 could deliver.

I am not sure if Apple remained a shareholder in ARM, or if SoftBank bought out everyone, but in either case it is time to get off the ARM train. I have had more than enough of what used to be Acorn RISC Machines.

I think that RISC-V is a much more modern and advanced RISC processor, and its ISA is open source. Its ecosystem has grown at a phenomenal clip, and it has the potential to displace ARM. I would like to see Apple join the RISC-V train and help everyone build out the ecosystem for firmware, software and tools such as non-GPL compilers – LLVM/CLANG comes to mind.

Hyperconverged Servers 2

Posted on 2021-09-07 by: Axel Kloth

As usual in IT, the pendulum swings in one direction to its extreme, and then back to its other extreme. We have seen disaggregated servers (compute nodes on one side, storage on the other), and then came the hyperconverged servers. Then it swung back to disaggregated systems, and now we are back with hyperconverged systems. It seems that the transitions are arbitrary, but in reality, they are not. The Next Platform questions why "If Hyperconverged Storage Is So Good, Why Is It Not Pervasive And Profitable?". I think that for this article, two reasons exist that would explain the question and frustration. First, the author appears to focus on just one company. Second, there are good reasons for both disaggregated systems and hyperconverged systems.

Let me first explain why a disaggregated solution might be beneficial. If compute nodes have to be physically close to some experiment (like at CERN), then space constraints may mean that disaggregated systems are the only choice. If large amounts of data are created and processed, and that data crunching is computationally very intensive, and then that input and processed data must be stored or archived, then again a disaggregated system might make sense.

If on the other hand the bandwidth needed from compute to storage and back is so high that networking is not fast enough, then hyperconverged systems are the only way out of that predicament. Hyperconverged systems also have the disadvantage that distributed storage (possibly even across continents) for disaster resilience is harder to implement. A little more background can be found here.

Another issue is that the author focuses on one company, and not on the segment. Maybe for the segment the revenue and profitability data is better. I have no insight into it, but I'd verify from other independent sources if in fact the entire segment is doing so badly.

VCs don't seem to fund cybersecurity companies

Posted on 2021-09-07 by: Axel Kloth

We see breach after breach after breach. Current hardware, firmware and operating systems as well as application software are not able to stem the tide. I am not sure what to make of it, as usually when there is demand, and a new product or service is available, customers come. A new product and lots of potential customers are generally what VCs salivate over. For reasons that are beyond my comprehension, that's not the case for cybersecurity hardware, firmware and operating systems as well as application software according to VentureBeat. I am at a loss. What gives?

The market is there as there is plenty of demand for cybersecurity. If the existing solutions worked, we would not see a continued problem in cybersecurity. While some of the breaches are due to incompetence and social engineering, the vast majority of breaches is due to exploits of weaknesses in all of the areas mentioned above.

IBM's Telum introduces a novel Cache architecture

Posted on 2021-09-07 by: Axel Kloth

It looks like we are not quite alone in pointing out that the current architectures for caches are fundamentally broken. IBM has unveiled its newest mainframe processor, and it does away with L3 and L4 caches. IBM's Telum is described here in detail, and the one thing that surprises is its new cache architecture. AnandTech expands on this a bit and explains why IBM may have chosen this path.

The most important takeaway is the same we have claimed for a while: caches are a band aid to mask the latency differences between a CPU core and memory. Caches do not contribute to any kind of computation. They simply hide the latency of main memory.

We are glad to see that we have confirmation for our thesis.

DRAM versus NVM

Posted on 2021-09-07 by: Axel Kloth

We have yet another point of reference that DRAM is too expensive even for Facebook. It seems as if Facebook is using (or evaluating the use of) non-volatile memory for cachelib instead of DRAM.

Our HRAM delivers more than DRAM performance at densities of Flash, and at a cost comparable to DRAM. I had alluded to this as it was clear from our simulations and matched results from Arvind at MIT here.

HPC is broken

Posted on 2021-08-28 by: Axel Kloth

The Next Platform stated in all capital letters that Dimitri Kusnezov, a Department of Energy (DoE) expert on AI, at HotChips 2021 stated that DOE AI Expert Says New HPC Architecture Is Needed. The DoE is responsible for all HPC efforts of the US government, so hearing from him that HPC is broken reaffirms our position. We have said that for quite a while, and so far, most technical experts agreed with us, but the sentiment of those people holding the purse strings was mostly that "it seems to be working fine". It was not, and it is not. Something fundamental has to change. With Kusnezov and the DoE agreeing that it is broken we believe that money will be made available to finally fix HPC for good. We are ready when they are.

Here is an excerpt from his talk at HotChips 2021, as recorded by The Next Platform: But the highly complex simulations that will need to be run in the future and the amount and kind of data that will need to be processed, storage and analyzed to address the key issues in the years ahead — from climate change and cybersecurity to nuclear security and infrastructure — will stress current infrastructures, Kusnezov said during his keynote address at this week’s virtual Hot Chips conference. What’s needed is a new paradigm that can lead to infrastructures and components that can run these simulations, which in turn will inform the decisions that are made. “As we’re moving into this data-rich world, this approach is getting very dated and problematic,” he said. “Once you once you make simulations, it’s a different thing to make a decision and making decisions is very non-trivial. … We created these architectures and those who have been involved with some of these procurements know there will be demands for a factor of 40 speed-up in this code or ten in this code. We’ll have a list of benchmarks, but they’re really based historically on how we have viewed the world and they’re not consonant with the size of data that is emerging today. The architectures are not quite suited to the kinds of things we’re going to face.”.

In other words, HPC was broken before AI made the problem so big that it cannot be ignored any more. To address the computational challenges from climate change and cybersecurity to nuclear security and infrastructure, among others, we will need a new HPC architecture that can deal with extremely large data sets and a speed-up of computation, I/O and storage.

We are working with our partners to advance HPC towards a new paradigm. We even bridge the legacy world to the novel system architecture, and we add resilience and security features to HPC without giving anything else up.

RISC-V HW Support for Virtualization

Posted on 2021-08-28 by: Axel Kloth

RISC-V is one of the most important novel processor Instruction Set Architectures (ISAs) of the last decade. It is well thought through, allows for very small implementations of a processor based on the ISA, and due to the fact that the ISA is open source, anyone can implement his or her own processor. What so far has been missing is hardware support for virtualization. We have implemented our version of HW virtualization support as we did not want to wait for the RISC-V steering committee to come up with its recommendation.

We have developed a method that is universal, future-proof, provides ample performance for virtualization and a hypervisor, and has a software and hardware interface to an IOMMU. We are happy to share this and license this technology on a FRAND (Fair, Reasonable And Non-Discriminatory) basis with any other RISC-V processor company.

HPC TAM

Posted on 2021-08-20 by: Axel Kloth

The Next Platform is reporting on the HPC TAM and its forecast for the next few years. It alludes to the fact that all assumptions point to The Rapidly Expanding And Swiftly Rising HPC Market because the number of HPC applications grow, and supercomputers are projected to become more affordable.

We have been saying for a while that the market is already >$5B annually for just the semiconductor components in supercomputers today, but we project a very high growth rate of that TAM. This article confirms and in fact assumes a TAM growth that is higher than our estimate.

The Future of APIs at CSPA

Posted on 2021-08-20 by: Axel Kloth

I am extremely honored to have been invited to give a talk about The Future of APIs for Accelerators in Open Source at the California Software Professional Association (CSPA).

If you are even remotely interested in APIs, accelerators, Open Source or the CSPA, please sign up for this event.

Big and slow is faster than small and fast

Posted on 2021-08-20 by: Axel Kloth

I had searched for this article and finally found it (again). It may sound counter-intuitive, but big and slow is in fact faster than small and fast.

Why is that the case? The reason is the enormous discrepancy between the throughput of a processor compared to DRAM, and DRAM bandwidth and latency compared to Flash memory. Processors today crunch through data incredibly quickly. In fact, they are so fast that SRAM caches are needed to hide the DRAM latency. If the processor cannot find the data in its cache, it will try to retrieve it from DRAM. It will take a penalty of many cycles to do so. If the data is not in DRAM, then it had been swapped out to disk in the past, and now the processor has to wait even longer as accessing a disk is painfully slow. Therefore, it makes sense to avoid having to access disks altogether. That is not quite possible yet as disks are dense and cheap, but reducing the frequency at which a processor has to access disks enhances performance. Therefore, exchanging expensive DRAM with cheap and slow (but still faster than disk) and much larger Flash memory makes sense.

For Big Data, it is not surprising that big memory beats small memory, even if the smaller memory is faster. This is exactly what Arvind Mithal at MIT has proven. Arvind found that the size matters more than bandwidth and latency in Big Data applications, and that is why the Flash-based cluster was as fast, if not faster, than DRAM-based clusters of servers. On top of that, the Flash-based cluster was cheaper. The reason for that is fairly simple as more memory means that the processors have to go to even slower disks a lot less often than they'd have to do with faster but much smaller DRAM memories.

This mirrors our research data and convinces us even more that our HRAM is the right direction to go as that combines the benefits of DRAM and Flash.

LinkedIn Connection Requests

Posted on 2021-08-20 by: Axel Kloth

I am really getting sick of the behavior I see with a number of LinkedIn members.

Two particular groups rile me the most:

Group 1 includes students who are too lazy to check the career page on our web site on how to find out how to apply for an internship, and instead of finding out how to apply they send a connection request.

In our experience, those interns rarely ever turn out to be interested in learning. It is unfortunate but a predicator of their performance, and therefore we do not get those on board any more.

Group 2 consists of sales people who request to connect to offer "how to explore synergies between our companies" when in reality they want to sell something.

The sales calls are more annoying. Let's say I find company A on LinkedIn, and they have a VP Business Development. My company B offers a service or product that is complimentary to company A's product or service. In other words, there is plenty of overlap in the TAM (market and customers), and our and their product and service complete each other. In that case we do not only not cannibalize each others' service or product, we offer a more complete solution so that a customer simply gets a better solution. As a result, our common customer benefits, we together have more customers, and both companies make more money - individually and as a synergistic group. In case of a synergy, we end up with more money in the bank, and so does the synergistic partner. The customer ends up with a better solution, and as a result, there are more customers with better margins per customer. That's synergy, and if I cannot find the VP BizDev email on their web site, I request a LinkedIn connection. I can usually make the case, and in most cases I end up with a good connection.

What is not synergistic is a sales call disguised as synergy. If my company's money ends up in your coffers because you sold me something then that is a sales call. There is no synergy.

Consequently, if you want to sell me something, call it that, and tell me exactly what the benefits are for me. I have nothing against a sales call if you make a good argument. If you did not look up what I do and I receive a generic sales pitch, rest assured I will immediately remove you from my contacts. Same for sales pitches disguised as synergistic deals: I will remove you.

Here you have it. You cannot say you have not been warned.

Supreme Court Ruling on Oracle vs Google

Posted on 2021-08-20 by: Axel Kloth

CNN Business says that Supreme Court hands Google a victory in a multibillion-dollar case against Oracle.

I'd rephrase it. I’d say that the SCOTUS has made the use of APIs reasonable. It is a somewhat difficult topic, so I will try to explain it in more simple terms.

Let's say I am developer of software or hardware. Let us assume that hypothetically, I have found a new way to execute the square root function both in hardware and in software better than anyone else. I also want people to use my new square root function. So I publish an application programming interface (API) for it by defining y := sqrt(x), and I define what the argument values x and y are, and I define the types of representation (integer, floating point, UNUM, POSIT, and their respective lengths). In other words, I publish the API y := sqrt(x) for everyone to use so they do not have to invent yet another square root algorithm.

The inner workings of my hardware or software that are called by the API are not visible to anyone without de-compiling or disassembling them. In general terms, they are binaries or executable CPU-specific machine code, or they are calls into a specific piece of hardware that I may have developed. In my function, I first check if the hardware is present to execute the sqrt function. If so, I hand over the input data x and wait for the hardware unit to be ready, and when it hands me the result back, I put this into y so that the calling software can use y. If I do not detect the hardware needed, I execute sqrt in software, and then hand over y as before.

I can hide the function for sqrt and never have to expose how I do it. If someone comes up with a better implementation of the function, then they likely at some point in time will displace my function - whether they call it sqrt or not. In no case should I be able to sue anyone just because they call their function sqrt unless I trademarked it to avoid diluting my brand. However, if I do not trademark it, and the sqrt function call name should not be patentable, then anyone can use sqrt, create another better sqrt, or create something similar that encompasses mine.

That is very different from stealing how I execute my sqrt operation inside my hardware or software. If someone does that, then I should have the right to sue them for IP theft.

If someone merely re-implements an existing solution with the same APIs or simply uses the APIs via function calls from an application software, then that is the intended purpose of an API, and it should not be subject to a legal battle.

That was what the Google versus Oracle suit was all about, and rightfully Google prevailed.

Robustness and EDC/ECC in Memory

Posted on 2021-08-16 by: Axel Kloth

When we look at Supercomputers, we look at the culmination of thousands of cores, Terabytes to Petabytes of main memory, and Zettabytes of mass storage. If we want the results to be mathematically correct, we first of all have to use a number system that allows mathematical precision to the degree needed. Recent research by Professor John L. Gustafson has made it abundantly clear that Floating Point math does not quite work as well as we had assumed. UNUM and POSITs are much better representations for numbers in registers that are inherently limited in length. You can read up on UNUMs here: The End of Error: Unum Computing - CRC Press Book.

Google has found spurious errors in CPUs executing arbitrary functions, FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof indicating either production, clock frequency or inherent design issues at modern process technology nodes used in production. Memory has grown to sizes where even an error rate of 1 in 10¹⁵ is not good enough any more. As a result, we must make sure that we account for these problems properly. CPUs must be designed with more focus on correctness and on robustness. Memory must be built such that autonomous verification of the correctness of stored information is guaranteed. Verification of the stored contents must be done without CPU intervention, and detection of multi-bit errors and correction of single- and double-bit errors is done inside the memory subsystem itself, and not only during those times when data is written to or fetched from main memory. In other words, main memory has to become smart. It has to be able to scrub itself periodically, it needs to detect spurious and persistent errors and flipped bits, and it has to be able to correct spurious single- and dual-bit errors autonomously. It should be able to take persistent defective cells or pages out of operation without CPU intervention, and it must remap spare memory into that affected address space. The host processor should be informed of this fact, but its memory space must not be affected in any way. In essence, the remapping should be done such that it is invisible to the operating system memory management tables and the processors' address space, but visible to the processor's and operating system's OAM&P software.

DRAM itself has design vulnerabilities that can be exploited. These are predominantly what is called Rowhammer and Half-Double. TechExplore published that Google announces Half-Double, a new technique used in the Rowhammer DRAM security exploit. Google itself published this exploit here. The story was picked up by Gaming on Linux, by ZDNet, by Wired and by TechRadar.

Unfortunately, it does not end there. Despite the introduction of "Secure Boot", the BIOS remains vulnerable. Since the BIOS contents is non-volatile, any attacks against the BIOS and a successful overwrite will install a persistent threat. As an example, ThreatPost claims that 30M Dell Devices at Risk for Remote BIOS Attacks.

It is therefore imperative that we design a better memory subsystem that offer better performance, supports more linear scale-out in performance, provides better robustness against spurious errors, can detect multi-bit errors autonomously, can remap memory areas without CPU intervention, shields against RowHammer and Half-Double attacks, and incorporates defenses against BIOS attacks.

Biden White House on the SolarWinds attack

Posted on 2021-08-16 by: Axel Kloth

The SolarWinds attack was a devastating event. As far as I know, the aftermath has consequences beyond just the claimed victims. The Biden White House says it will hold those responsible for SolarWinds hack accountable within weeks. As far as I can tell, no one has been convicted yet, and the impact is far from being fully understood.

Supercomputer "Speeds"

Posted on 2021-08-16 by: Axel Kloth

I always hear people talk about "Supercomputer speeds". Usain Bolt is fast for a human. Cheetahs are fast. Peregrine falcons are fast - at about 320 km/h or 200 MPH they are incredibly fast. But that is not the type of "fast" that we are looking at when judging computers. Computers are rated based on the amount of computational problems they can solve in a given period of time. Certain "benchmarks" have evolved over time to make such determinations. Those benchmarks include but are not limited to LINPACK, BLAS (Basic Linear Algebra Subprograms) and DGEMM/SGEMM (double and single precision Matrix Multiplication). Both BLAS and DGEMM as well as other benchmarks define a certain size of the matrix (or matrices) for the multiplication to be carried out. Since BLAS and DGEMM make extensive use of what in computer science is called a fused multiply-add or short FMA this is the instruction that CPU designers optimize most. For any given size of the matrix the multiplication requires row elements to be multiplied by column elements. These are usually neatly arranged in memory and are loaded by the Cache Controller in blocks called Cache Lines, to circumvent the high latency that DRAM (Dynamic Random Access Memory, the main memory) has. Therefore knowing the maximum size of the matrix and optimizing the Caching algorithm for the matrix size and having an efficient FMA can be used to create benchmark results that oftentimes cannot be replicated in real life applications. A big problem with that approach is the fact that while caching itself is not bad, it leads to the design and the deployment of caches that are larger than they need to be compared to a better-balanced system. Since Caches consist of very fast transistors (particularly the TCAM or tag RAM in them) they consume a very large portion of the energy that the processor uses. However, Caches are only masking latency differences between the internal registers of processors and the main memory they use, which is DRAM. Caches do not compute, and they do not create any computational results. Partially because of BLAS and DGEMM the industry has increased Cache sizes more and more, and focused less on improving interconnects and bisection bandwidth. However, that has left us in a situation in which we have very large Caches in the processors and accelerators, and we bank on those Cache Controllers to have pre-fetched the proper data instead of making sure that the interconnection bandwidth between processors is high enough to enable any-to-any core access without having to go through remote memory and thus relying on the efficiency of the Cache Controllers and their algorithms and policies for caching and aging.

For n-body, FEM and FEA and all other computational problems with nearest-neighbor interaction the only metric that counts is bisection bandwidth and latency. That cannot be resolved with caching. A novel architecture is required. That's the reason why we do things differently.

Just to recap, here is how outsourcing some of the computational work from a CPU to an accelerator or a coprocessor works:

A piece of software that was written to distribute tasks between CPU and accelerator identifies a task that benefits from execution on an accelerator
As soon as all input data and accelerator instructions are available, the CPU (or a DMA Controller or IOMMU) transfers the data to the accelerator or its memory
The CPU instructs the accelerator what to do (basic or compound math functions) in the accelerators' instruction set
The accelerator crunches through data
While the accelerator finishes its task, the CPU core(s) can continue executing other work
Most often during those times, the host CPU retrieves and prepares new data
When the accelerator is done crunching, it issues an interrupt request to the host CPU
The host CPU responds to the IRQ and retrieves data from the accelerator via software or through a DMA Controller or IOMMU
This sequence continues until all data is processed

To accelerate the execution of instructions, nearly all accelerators are created from combinatorial logic, i.e., they do not use a multi-stage pipeline like processors do. While an accelerator built from combinatorial logic is effectively a fixed-function device, it will use a lot less energy to execute a particular task compared to a programmable processor, and usually it will do so in less time, it is of course less flexible. It will execute the function it was designed to, but it is not programmable. In other words, if such accelerator is designed to execute function xyz, and xyz is deprecated, then it cannot be used to execute function abc instead.

That is one of the reasons why most accelerators focus on optimizing the execution of well-settled functions that do not change. Examples are matrix math, tensor math, finite element analysis, any kind of transforms such as Fourier Transforms. There are a few more computational problems that have not changed in a long time, and most cryptographic functions fall into that category. However, with the advent of quantum computing it is unclear if AES, SHA-2 and SHA-3 will have a future, and as such, they fall into a category that is not actively pursued much outside of cryptocurrency mining and blockchain verification. Cryptanalysis will continue to be of interest to a lot of organizations.

Looking at above list also makes one other implication very clear. There is a considerable amount of administration required to farm out a computational task. As a result, only tasks that are fairly lengthy to execute on a general-purpose CPU should be offloaded to an accelerator. If the administration of farming out takes 100 cycles, then the task that is farmed out should save at least 1000 cycles, or its benefits are drastically diminished. As a result, we see mostly fairly coarse granularity of farmed-out tasks.

How an Interview can go wrong

Posted on 2021-08-16 by: Axel Kloth

I have been interviewed many times in my career. The vast majority of journalists gets it right, sends you an upfront draft to review and edit if so needed, and then publishes the article with the quotes. However, sometimes it goes terribly wrong, so be careful who interviews you, and who quotes you. I have included one example that still irks me today as the alleged quote is not only factually incorrect, but more importantly, I have never been interviewed by the person who claims this quote.

Here is what I told a journalist who interviewed me. This is the reviewed and agreed-upon text:

If you receive a postcard, ask yourself what part of the data on the postcard is correct and trustworthy. Usually, on a postcard you'll find your (the recipient’s) address, some text or advertisement from someone, and possibly a sender’s address. It might even have a stamp. Out of all this data, the only information you can trust to be reasonably correct is your address. If it were not, you would not have receive the postcard. The text on the postcard can be fully made up and thus cannot be trusted. The same is true for the sender's address and name. Neither one of those pieces of data have to be present or correct for the postcard to arrive at your address. As such, you can't trust much of the data on a postcard. The situation is not much better with a letter. If inside the letter there is a verbatim copy (or even better, an encrypted version of the sender and receiver data based on a pre-shared password) of the recipient's and the sender's address, it gives you a little more confidence that these pieces of data are correct, but not much. After all, envelopes can easily be opened and re-sealed. The equivalent assumption of confidentiality and correctness and authenticity can be made for any kind of unencrypted communication over the Internet.

As far as I know, the agreed-upon article was never published. Somehow, the interview text made it to someone working for an Indian newspaper, and that person not only entirely butchered the text, but made it entirely wrong. This is what was published by a person who never talked to me, including a quote that I never gave:

"Say you receive an envelope and on that envelope is your name and address, a return name and address and a postmark," he explained. "You can authenticate the recipient with surety. If you know the person on the return address you might know who isn’t it, and the postmark gives you some idea that the government has properly delivered it. But you do not know if the envelope contains a letter or anthrax powder. If you cannot authenticate each part of the delivery mechanism you don’t have security."

As you can see, the alleged quote has nothing to do with what I said. It is factually incorrect, it adds items that I have never said, and a style analysis reveals that in fact this is not in line with any of my other statements or posts. The problem is of course that this entirely made-up "quote" is out there, and there is nothing I can do to make the journalist retract it.

HPC, System Uptime and Security

Posted on 2021-08-07 by: Axel Kloth

There is one recurring question I get, and that is “Why if you are an HPC company do you care about or post on security and system availability/uptime issues?”

Let me explain why. First of all, we are not a security processor company. Second, we believe that security and hardening measures must be included in all new hardware, firmware, operating systems and application software. Third, we see that with 5G and Fiber To The Home (FTTH) and Fiber To The Curb (FTTC) the attack surface for the Internet backend has grown and will continue to grow, and the available bandwidth for attacks to take out large portions of the Internet infrastructure has increased to a degree that the so-called “edge” will have to take a role in protecting the backend.

The magical “edge” is the collection of curbside compute that allows for 5G and FTTH and FTTC to be viable and useful. Edge computing reduces latency to the user as some of the preprocessing is executed there, and low latency is crucial for a wide variety of applications – gaming and Advanced Driver Assistance Systems (ADAS) as well as Vehicle-to-anything (V2X) included. While most compute will still occur in the backend, the edge will become more powerful, but unlike the backend, the edge is not physically protected. As a result, any attacker will have relatively easy access to the compute power at the edge. Because of the compute performance and the bandwidth of edge devices, an attacker can wreak havoc on the Internet backend, including supercomputers. Therefore, we believe that the edge must be able to protect itself and the backend from attackers as much as possible. A botnet of edge devices will bring down the Internet.

We have seen the tactics of attackers change. We went from Denial-of-Service (DoS) and distributed Denial-of-Service (DDoS) attacks to attacks against the DNS system, and while the Internet did not go down, many users experienced it as such. While we have our own DNS servers, and we were able to continue to work without interruption, many other users were not. With more bandwidth and compute power at the edge, an attacker has a much higher chance to disrupt Internet access and Internet traffic.

That is why we focus on the system availability including security measures we need to take to protect systems built with our processors. If a system is under attack, if its incoming links are unavailable, if it has been forced to reboot or is in a rolling recovery situation, it is not available for its intended tasks. This is all counted towards system downtime, and HPC time is expensive and precious. As a result, we try to make sure that systems built with our processors have the hardware and firmware robustness needed to withstand an onslaught of attacks from compromised edge devices, and ideally, we’d work with those edge devices to avert a situation in which a large number of edge devices is compromised in the first place.

Due to its enormous compute power and bandwidth we need to make sure that compromising a supercomputer cannot occur. Any computer is most vulnerable to attacks during boot time, when most hardware and software security mechanisms are not yet active. We are working on solutions to plug those holes. We are also working on a more active firewall. For example, if someone pings us, we do not respond. If from the same IP address we then see a netscan, that event is logged, but today manual intervention is needed to block that IP address. That does not make sense – the firewall should block that IP address after one ping (or traceroute) and one netscan attempt all by itself.

Aside from those reasons, I am as annoyed as anyone else if the Internet is not available or upload or download speeds are abysmal – even at home.

HPC as a Service

Posted on 2021-08-06 by: Axel Kloth

It was only a matter of time until somone would announce and implement HPC as a Service (in the Cloud, of course).

Fundamentally, a supercomputer today is the same as any data center operated by the hyperscalers and all of the Cloud providers. The infrastructure in both cases consists of Commercial Off-The Shelf (COTS) x86-64 servers, connected via COTS top-of-the-rack switches and a COTS global switch connecting the top-of-the-rack switches, plus an equivalent storage subsystem, and some of the general-purpose CPUs may be accelerated via General-Purpose Graphics Processing Units (GPGPUs).

The only difference is in the interconnect, which in the case of the hyperscalers usually is 10 Gbit/s Ethernet, whereas the Supercomputer faction typically uses lower-latency and higher-bandwidth InfiniBand with DDR or QDR data rates of 20 or 40 Gbit/s. As a consequence, it does not take much to change a portion of a hyperscalers' data center over to support both Ethernet and InfiniBand, and sell the CPU hours at a much higher price to HPC users compared to normal users.

Here at ZDNet is one of the stories about HPC as a Service.

Intel, GlobalFoundries and SiFive

Posted on 2021-08-03 by: Axel Kloth

Intel has been in the news a lot lately. That is largely due to Pat Gelsinger rejoining Intel. I have the upmost respect for Pat, and I know that he is up to quite a challenge.

I don't know how to say this this any friendlier than this, but at this point in time Intel is a half-trick pony. Intel has x86-64 that is slowly but surely running out of steam, and a few failed expensive acquisitions that don't play well together. MobilEye and Altera really don't complement each other, and I do not see how Nervana and Habana could augment each other's offerings. Also, Intel just discontinued Itanium, and while that was a good and overdue step, it reduces the options in processor ISAs that Intel has available to itself.

Then there is the fact that in all large organizations complacency sets in. In the beginning, every startup will attract people who want to do something exciting and new, and they work their butts off. Once a company is established and generates profits, it will attract a different kind of applicant. They are usually seat warmers, and their only purpose is to collect a paycheck. They are the status-quo people who object to any change. They justify their behavior to themselves and to others by stating that someone has to keep a straight course. They know or at least suspect that their behavior might sink the ship, but then do the math and quickly come to their conclusion that the 15 years they need to retire is not enough for the mothership to go belly up.

Pat will have to clean out the house. I assume that in due time, we will see a RIF to get rid of people who do not contribute. He is in a much better position to identify those than the prior CEOs who were deeply non-technical. He will need chip manufacturing capacity, and that is where the GloFo acquisition comes in handy. He needs a different processor architecture as it is certain that x86-64 will run out of steam. That is where SiFive fits in, and I had alluded to that earlier here: Intel interested in SiFive. GloFo will also help Intel to be a more customer-centric organization, and while GloFo itself is not the best at that, it will be a wakeup call. Intel will be able to shift production around: datacenter CPUs and accelerators on its newest nodes, and all supporting and peripheral chips in the older fabs from Intel and in the newly acquired GloFo fabs. Then slowly build out ex GloFo, bring them into the FinFET and GAA age, and then reap the benefits of using the tools that external customers use inside Intel as well. I suspect that at this point in time, Intel spends a fortune on tools developed in-house for the design and the DV (design verification) of its processor and peripherals. If Intel can switch over to commercially available tools from Synopsys, Cadence and Siemens Software (ex Mentor) as well as from AnSys, then that would save tons of money, it would allow Intel to attract design engineers from the outside and get them productive without retraining on its internal tools.

Intel has discontinued Itanium

Posted on 2021-08-03 by: Axel Kloth

Over 20 years ago, Intel introduced a non-x86 processor, the Itanium. This was a collaboration between HP and Intel for a successor to the Intel x86 architecture. HP contributed what was HP PA (Precision Architecture), and Intel added EPIC (Explicit Parallel Instruction Computing). EPIC was a departure from traditional CPU design as it relegated a lot of the work of parallelizing instructions to the compiler instead of hardware within the CPU. Intel promised improved performance over the traditional design philosophies, and positioned Itanium above anything x86. Unfortunately, those promises never materialized. Itanium not only was unable to ever surpass x86 in performance in emulated or in native mode, and even native Itanium applications never were able to drastically outperform any other competitor. Part of the problem was the complexity of the compiler, and Intel never managed to create a compiler that was fully able to extract the theoretical benefits of EPIC.

While Intel had announced the discontinuation of Itanium years ago, its eventual demise came quietly, and would have gone mostly unnoticed had not publications such as TechSpot announced it. Most comments were not exactly kind to Itanium, as we can see here: Intel's Itanium is finally dead. To a large degree, that criticism is well-deserved, but Intel certainly deserves some credit to introduce a new processor architecture in which parallelization was done in an unconventional way.

While some put the blame on the lack of compatibility to legacy x86 (yes, the 32-bit version, not x86-64) software, I don't think that this is correct. Had Itanium shown better performance in native mode over any other processor or over the emulated x86 code, then I am fairly certain that Itanium would have succeeded despite using a different ISA. But it never did. I think that both the price point and the performance (or lack thereof) had to do with its demise. In the end, only HPE used Itanium - and for a good reason.

Rest in Peace, Itanium, and hopefully with EPIC at your side. We will bury you next to SPARC.

HPC applications with impact

Posted on 2021-08-03 by: Axel Kloth

HPC is not an easy topic to explain. I frequently encounter people asking me what I do. Here is a brilliant article outlining what HPC has been able to do for humankind lately, and that article is only focusing on 6 applications that saved lives and make our world a better place.

GlobalFoundries going IPO

Posted on 2021-08-03 by: Axel Kloth

In the past few weeks, there were lots of news about GlobalFoundries. One rumor stated it was going to be acquired by Intel. The Wall Street Journal reported on it here: Intel Is in Talks to Buy GlobalFoundries for About $30 Billion. Forbes tried to find a good reason for this acquisition in Intel’s Possible Rationale For Buying GlobalFoundries, Inc., and even The Next Platform was not quite sure if the idea was that great. Another Crazy Idea: Intel Might Buy Globalfoundries is not exactly an endorsement.

Its CEO Tom Caulfield responded nearly instantaneously here: CEO of ex-AMD fab GlobalFoundries shoots down Intel buyout and stated that it was planning on an initial public offering (IPO). My opinion is that very simply, GloFo is not set up for an IPO. GlobalFoundries was created by spinning off the former AMD chip manufacturing unit, and over time it added more and more fabs to the portfolio, including the former IBM Microelectronics. They all used specialty processes, and they had to be integrated with each other. In the process of doing so, it lost sight of the need to perpetually improve. GlobalFoundries was left with an embarrassing acknowledgment that it would not pursue any process nodes beyond 10 nm. While that is a FinFET process, and it has licensed Silicon-on-Insulator (SoI) from ST Micro (and later on from Samsung), it is not a leading supplier for the manufacturing of processors and accelerators. In essence, it now has planar transistor processes, one or two SoI processes, and one or two different FinFET processes. I do not see that they can support their customers with Process Development Kits (PDKs) with the current level of revenue. Simply put, trailing process nodes command commodity pricing.

Hyperconverged Servers

Posted on 2021-08-03 by: Axel Kloth

Everything old is new again. The first supercomputers were machines that were very different and distinct from any regular computer. They were different from minicomputers, mainframes and from PCs. At that time, internal interconnects were vastly faster (bandwidth and latency) than anything a network could offer, so compute and storage were in the same machines. A supercomputer would contain not only the compute subsystem, but also the storage subsystem. The same applied to minicomputers and PCs.

Then, it turned out that more bang for the buck could be had by not building dedicated and special-purpose supercomputers. Instead industry standard servers were used to create a supercomputer. They were just regular PC-based servers connected via the fastest interconnects that could be had. That worked out well for a while, but disks were very slow, and tapes even slower. Since data management was needed, and that alone was computationally moderately intensive, storage management was put as software onto industry-standard servers to take over all storage management tasks, and so the storage appliance was born. This allowed supercomputers to be logically divided into two different classes of computers: the compute clusters and the storage clusters. Compute clusters focused on compute (i.e. lots of CPUs and accelerators and memory), and the storage appliances in the storage clusters took over all storage tasks, including snapshots, deduplication, tape operations, backup and restore and of course disk caching. Doing so cut down on cost while improving performance. It worked well for as long as the network was faster than cached disk I/O. The advent of Flash in the form of SAS- or SATA-attached SSDs started to change this. PCIe-attached storage provided level of performance that network-attached storage simply could not match any more. PCIe Gen3 in the 16-lanes variant tops out at less than 16 GB/s, and so that is the limit that can be achieved using PCIe-attached Flash or network-attached storage. As a result, all high-performance storage was pulled back into the compute nodes, and only the nearline and offline data storage on SATA disks and tape as a backup is now left on storage appliances. In essence, what used to be a performance-enhancing technology now has become simply a bulk storage and backup/restore technology, possibly with features for deduplication. This has now caused a simple convergence of compute and performance-oriented storage, creating the hyperconverged server.

High-Performance RISC-V Cores

Posted on 2021-07-10 by: Axel Kloth

Multiple media organizations including Heise in Germany have reported that China (more specifically the Institute of Computing Technology at the Chinese Academy of Sciences or ICT CAS) has built a high-performance RISC-V core. Even The Register reported on this, under the headline “Chinese chip designers hope to topple Arm's Cortex-A76 with XiangShan RISC-V design”.

The processor is named XiangShan, and ICT CAS faculty have posted its entire CHISEL/Scala source code on GitHub. There is also a fair amount of documentation, some of it in English, and a few schematics and block diagrams that go with the basic architecture. I have not had time to synthesize the processor and verify the performance (that will take me a while anyway), but from the text and the schematics the claims of the performance make sense. It is an out-of-order design. The pipeline depth of 11 stages seems to be confirmed by the documentation and the schematics, and so is its six-issue width. It seems to rely on 4 DDR DRAM Controllers, and I am still trying to find out if the DRAM Controllers are part of the design. In the original SiFive and UC Berkeley designs, the DRAM Controller is not included.

Without knowing anything about the DDR DRAM Controllers, it is hard to estimate what the performance is going to be. After all, most modern CPUs outperform their memories, and therefore the caching strategies (Cache types and hierarchies and Cache Controllers) and the DRAM Controllers are an integral part of a processor's performance.

If the DRAM Controllers are not part of the design, then commercially available DRAM Controller from Synopsys or Rambus (ex NorthWest Logic) can be used. However, that takes away part of the open source design, as without the DRAM Controllers, the chip cannot be finished. I am also still looking for the PCIe Gen3 Controller with the 4-lane interface, and I have not yet found it in the source code. Having just finished the design of a PCIe Gen3 Endpoint Controller with the 4-lane interface I can say that this is not a trivial undertaking, but again, if this is not part of the open source and it has to be sourced from a commercial entity, then most core components of this chip are open sourced, but not the whole design.

Either way, if in fact the ICT CAS taped this chip out at TSMC on a 28 nm node, then a maximum clock frequency of 1.2 to 1.3 GHz is believable. If that was the target node and if ICT CAS used the Synopsys DDR4 DRAM Controllers, then I have a hard time believing that this processor achieves the ARM Cortex A76 level of performance. Considering the more complex ISA of the ARM processor with more compound instructions and its higher clock frequency, I’d think that the XiangShan processor is probably at the 60 – 70% of performance level of the Cortex A76. It nevertheless is an extraordinary achievement from the ICT CAS team. I do not want to take away anything from their impressive achievements, and considering the learning curve, I’d wager that it won’t be long, and they'll outperform the ARM family in per-core and per-processor performance metrics. However, like ARM, they have not solved the underlying problem of scalability.

Looking up the meaning of Xiang Shan comes back as Fragrance Hill, a hill near Beijing. I am not sure if there is a deeper meaning behind this code name other than mimicking Intel’s use of lakes as code names for their processors.

Right to Repair and Privacy

Posted on 2021-07-09 by: Axel Kloth

It looks like Steve Wozniak has commented on the Right to Repair. When Steve talks, I usually listen. I agree that he has a point as the repair of a device if oftentimes economically and ecologically advantageous. It does not make sense to throw away a device that is repairable when and if a non-crucial component is damaged. The question then is what is a non-crucial component, and when should a device be deemed non-repairable because the repair has an impact on whether we trust the repaired device to keep secrets the same way that the original device did.

After all, for many of us, the phone has become the holder of all of our secrets – bank accounts, passwords, social security number, cashless payment system info, birthdays, fingerprint or iris scans and so on. We inherently trust Apple and Samsung, Qualcomm and a few other companies that supply the mainstream phone manufacturers because they have an established supply chain that is thoroughly vetted and continually monitored for compliance with applicable laws and additional policies imposed by the phone manufacturer.

Let me use an analogy to highlight where the issue lies. Let’s say your Mercedes is involved in a fender-bender. Only the right front fender, which is bolted on and has no impact whatsoever on anything in a subsequent crash, is damaged. You would not throw away the car because the fender is bent. You’d repair the car, and you might chose to use an aftermarket fender that is cheaper. While it may rust through ten years down the road, possibly earlier than OEM would have, there is no impact on the safety of the car itself, its value, or its longevity. If a crucial part of the car is damaged, such as the components that make up the safety cell, it is usually deemed non-repairable because in a subsequent accident, the car might not protect you.

The same applies to the phone. If a socket, a switch or the display get damaged, repairs are and should always be possible and feasible. Same for the battery.

Now if the CPU or the security processor needs to be replaced, I would deem that non-repairable by a 3rd party. Why? As stated above, Apple and Samsung and other established phone manufacturers have a supply chain that is well controlled. If you bring your phone with the dead CPU or security processor to shadytreephonerepairs.com, which sources the security processor from insecurityprocessors.com and the firmware for it – to resemble something like a working security processor – comes from backdoorfirmware.com, then I’d question if the prior level of trust in the phone to keep things private can be re-established.

If trusted and trustworthy 3rd party repair companies exist that have a supply chain that is similar to the OEMs, then that is a different story, but history has shown that that is not always the case. The danger of knowingly or unknowingly using counterfeit parts always exists if the supply chain is not clear.

Louis Rossman, who runs the non-profit Repair Preservation Group Action Fund, has shown in many of his videos on Youtube that manufacturers’ repairs (or attempts thereof) are not necessarily always successful or even state of the art, but at least they use the same components as the OEM, coming from the same supply chain.

What is my take? I support the Right to Repair movement and think it makes sense. However, some provisions have to be made to either limit that right if crucial components are affected, or to help establish 3rd party repair companies with a supply chain and associated quality standards and a guarantee from that company to the user that OEM parts are used. Those limits must be made very clear so that there is no confusion over the components that are excluded from the Right to Repair bill. For example, everything that is on an export control list must be excluded from the Right to Repair provisions. Enforcing the compliance with these regulations will not be trivial. I guess that the laws encompassing the Right to Repair will have to be very strict and detailed, and they will have to reference or include many other and secondary laws, and as such making the repair industry compliant could be the economic nail in the coffin for many devices that were initially targeted by the proposed Right to Repair laws.

VPNs and TOR – a Security Assessment

Posted on 2021-07-07 by: Axel Kloth

As a result of my postings with regards to the supply chain attacks against SolarWinds and Kaseya I have been asked what I think of VPNs (Virtual Private Networks) to fend off some of the threats. We will analyze that.

I think the term VPN has been misused lately, so I’d like to clear up what it is, and how the different types of VPNs compare. While we are at it, I’ll throw in TOR as well.

VPNs have been around for a while, and they protect data-in-transit. The traditional understanding of VPNs is that they span a Virtual Private Network across the planet, and they use the Internet as a transport medium. A VPN will “tunnel” through the Internet. It does that by encrypting the traffic between the endpoints after having authenticated them. Companies started using VPNs as a means to connect remote workers to a central office. These types of VPNs worked well as they made the contents of the VPN that tunneled through the Internet invisible to any attacker. Any attacker was only able to identify the two endpoints of the communication, but it was and is impossible to see the payload (the contents of the communication). The VPN endpoint usually was the employee’s computer or a VPN-capable firewall at the employee’s home office. It did not matter where on the planet the employee was, he or she could log into the company’s network and computers and server as if he or she was in the office.

A very similar type of VPN was used between two or more branch offices of a company. That made it possible to transmit secret messages between branch offices and remote workers without the risk of someone snooping or inserting unauthorized contents. In those applications, firewalls were deployed that acted as VPN termination points. Each branch office had to use VPN firewalls to participate in the secure communication. They had to be set up properly with the encryption type, a pre-shared key, and the appropriate methods for authentication (for both the phase 1 and phase 2 negotiations).

It is important to mention that both of these types of VPNs are end-to-end connections with end-to-end encryption. Typically they use a protocol called IPSec, which allows for a persistent connection. This is in contrast to SSL and TLS, which are transient connections and tunnels.

To my knowledge, VPNs were never successfully breached for as long as they were set up properly. The problem with these VPNs was largely that it was difficult to set up, and that interoperability was not great. Cisco gear did not want to communicate with Juniper gear, and others did not fare much better either. The biggest challenge though was the implicit assumption that all endpoints were secured and free of malware. In essence, the VPN treated all endpoints as if they were local resources on the internal company LAN. That assumption was oftentimes proven wrong. Particularly on computers that were bought and administered by the employees (“BYOD”, or Bring Your Own Device policies), the trust put into these devices was misplaced. Any malware and worm that was capable of spreading within a LAN was capable of traversing the VPN, and so VPN-attached endpoints had to be considered about as safe as any device on a DMZ or De-Militarized Zone. As a result, malware scanners in the firewalls were made necessary for all VPN endpoints. That additional policing and filtering took away many benefits of the VPNs, and as a result, VPNs were abandoned by many companies.

We would advocate for the continued use of VPNs if possible, particularly from a security and performance standpoint. As a result, we have implemented a smart NIC in our Server-on-a-Chip that executes all necessary functions for packet filtering, encryption and decryption as well as authentication independent of the application processor cores. To make it clear, VPNs and packet filtering alone will not stop infiltration or exfiltration of data caused by supply chain attacks as those are directed against the server infrastructure itself, but with VPNs and filtering and authentication as well as with code signing the likelihood of another supply chain attack against an IT infrastructure software provider can be minimized.

The term VPN has been taken up by a new crop of companies promising better Internet security. Let’s see if that holds up to scrutiny. Most Internet users are connected to the Internet through their Internet Service Provider, or ISP. This can be DSL if your carrier uses phone lines, cable if your ISP uses coax cable for TV and data, or glass fiber if your provider uses Fiber to the Home (FTTH) or Fiber to the Curb (FTTC) and a secondary fiber from the curbside unit to the home. In either case, all of your traffic goes through your ISP. Your ISP can therefore monitor and observe all your traffic, including but not limited to your DNS (Domain Name System) traffic. The DNS infrastructure is a very important part of the Internet, and it resolves the symbolic name you type into your browsers’ address bar into an address that networking devices understand. While you type NYT.COM, your computer does not know what that is. Neither does your Internet access device, but it will ask the nearest or pre-configured DNS server what that is in IPv4 or IPv6 language. The DNS server will resolve this and report back 199.181.173.179, which is a public IPv4 address. Your computer’s browser can now establish a connection with NYT.COM under that IP address. Since your ISP has access to all of this traffic, your ISP knows which web sites you visit. If some of them are less than clean, your ISP will know that. VPN services have understood that and offer a solution to this. If you sign up to them, they will send you credentials for a VPN endpoint setup, and if you configure your PC, wireless router or LAN router with these credentials, then the traffic between you and their ingress router will be encrypted. In other words, they obscure your traffic and the ISP cannot monitor it. However, since your VPN is terminated at the ingress side of the VPN services’ router, and since your ISP cannot provide DNS services to you, the VPN provider will resolve all DNS queries, and then forward and route your traffic accordingly. In simple terms, your VPN provider now has the technical ability and in some jurisdictions the mandate to collect your traffic and your metadata (your IP address and the IP address of all of the web sites you visit). In reality, you have not gained any privacy or security. You have shifted the data collection from your ISP to your VPN provider.

This is very similar to TOR, short for The Onion Router. TOR was and is a special version of the corporate version of Firefox. TOR searches for other TOR users that have published their involvement and participation in a secret database, and it will then use a number of other TOR users as if they were a VPN service provider, while making sure that the traffic between you and the next TOR user is encrypted. Your traffic will in essence go through your ISP fully encrypted, and it will use a TOR user as an entry and entry point into the Internet. TOR actually uses multiple hierarchies of VPN tunnels so that TOR users and TOR exit points are not easily identifiable and traceable, but it is not 100% bulletproof as an analysis of delays can render enough data on how many layers of encryption and how many TOR users are in the chain to ultimately crack it and pinpoint the true endpoint of the traffic.

TOR was originally developed by the CIA to allow whistleblowers all around the world to send secure messages to the CIA without exposing the source to anyone outside of the CIA. An analysis of the code of TOR revealed a whole lot of measures that were taken to protect the whistleblower, but also plenty of vulnerabilities of the concept, including a phone-home provision. As one would expect, the author of this used TOR after having removed the phone-home provision, and experimented with the browser and its concealment provisions. Today, TOR is mostly used to access the Dark Web. That is a theme for a different discussion and blog post.

Yet another Ransomware Attack

Posted on 2021-07-05 by: Axel Kloth

It seems like there is no letting up on ransomware attacks. Axios reports that Kaseya hackers demand $70 million in massive ransomware attack. This appears to be yet another supply chain attack. A supply chain attack is an attack that is directed at a company that provides IT infrastructure management tools. A fraudulent piece of software is inserted into this management tool or toolset. Usually, this piece opens up a backdoor by which the attackers can access every user of the management toolset. Since the tools has administrative privilege, it can do anything it wants. In the case of ransomware attacks, the attackers encrypt the server(s) contents to a key of a keypair only they know. If the attacks are noticed in time (i.e., before the last valid and unencrypted backup is overwritten by an encrypted one), then all it takes is to disable the IT management tool and restore the prior, most recent valid backup. In that case, only the data that was generated between the most recent valid backup and the time of the breach notification is lost. Restoring a backup also takes quite some time, so the business will be interrupted for a while. This is not a desirable situation to be in, but it is better than having to pay a ransom and hope that the criminals behind this attack release a valid decryption key. In my opinion, paying a ransom should be made illegal to avoid making ransomware attackers commercially successful.

We also need to think about additional levels of administrative privilege. Today, there is the admin (or root) and the ordinary user. This needs to be amended. There need to be admins and super-admins that administer certain rights and enable and disable admins, install firmware updates and updates to the OS kernel. Admins should then be restricted to administering users and applications above the OS level. Users should only have the right to store and retrieve data and use applications installed by admins. Very limited privileges for code execution should be granted. Limited users should only be able to store and retrieve data, without any rights to initiate any code execution. In those scenarios, IT management tools only have admin privilege, and so bulk encryption cannot be initiated. For firmware and OS kernel updates and changes, the tools must require elevation to super-admin privilege levels, and that will require additional authentication. As an industry, we might also have to rethink hardware, firmware and OS kernel security. Ultimately, we might also have to rethink the current approach of data-in-transit and the data-at-rest paradigm.

Here is a scenario I’d like you to consider.

Imagine that you need to transport large amounts of cash from A to B. Obviously, since cash has value, you’d like to protect the transport. So you build a transport van that protects the payload. The drivers are in an unprotected cabin. Does that make sense to you?

Similarly, once you arrive at the bank, the vault is pretty safe, but the tellers are unprotected. Does that make sense to you?

Well, that is exactly what we see in Internet security. There is a focus on data-at-rest and on data-in-transit. What’s not protected is the whole set of devices storing the data (even though the data may be on disks that are encrypted) and the whole set of devices that transport the data from A to B. In other words, Internet security focuses on protecting the payload. It does not make sure that the devices that transport and store the data are equally well safeguarded.

We can see the impact of those decisions. Data gets stolen, “scraped”, breaches occur, and data is lost in attacks that encrypt your data and hold it ransom. Would not you think it is time to protect the devices that store and transport your data equally well as the payload?

That is what Abacus Semiconductor is doing. Besides being a high-performance solution with our Server-on-a-Chip and our intelligent memory subsystem, we protect the device that protects the data-in-transit and the data-at-rest. With those measures in place, we might be able to stem the tide of ransomware attacks.

The USA turned 245 years old (or young)

Posted on 2021-07-04 by: Axel Kloth

This Fourth of July marks the 245th birthday of the United States. It gained its independence from the British in 1776. In five years, it will turn 250 years old. Not all democracies in the history of humankind have survived nearly a quarter of a millenium. Let's hope that the US democracy is going to be in a much stronger and better position then.

Intel interested in SiFive

Posted on 2021-06-11 by: Axel Kloth

It looks like Intel has submitted a bid for SiFive, the commercialization entity for the RISC-V processor and ISA. According to Tom's Hardware and Bloomberg, Intel Offers $2 Billion for RISC-V Chip Startup SiFive.

This would be a validation not only of RISC-V, but also of the ecosystem around it. I doubt that Intel would make that acquisition to shut down RISC-V as SiFive is not the RISC-V steering committee but merely the commercialization branch of RISC-V. This appears like a genuine approach of Intel to diversify into non-x86-64 processors. While SiFive has a focus on embedded systems and therefore can help Intel fend off ARM in those areas, I believe that this might signal the end of x86-64, and now puts pressure on AMD. This was a brilliant move, because as I have mentioned in my blog multiple times (FTC opens probe into nVidia and ARM merger, nVidia and ARM merger hits roadbumps, Apple and ARM and nVidia buying ARM), ARM will cease to be independent when the ARM acquisition by nVidia goes through. ARM will be under control of nVidia.

That means that all ARM architecture licensees will feel the heat from nVidia. Why does it matter? ARM architecture licensees are multi-trillion $ companies:

Apple
Qualcomm
Broadcom
Samsung
NXP etc

All of these companies will have to decide if they want to stick with nVidia/ARM or switch over to RISC-V.

RISC-V is an open Instruction Set Architecture and can be implemented by anyone. If need be, anyone can buy RISC-V processor designs from SiFive and in the future from Intel - or from us. We have been using RISC-V since about 2012 and have experience with it.

We are the only company that has developed extensions to RISC-V for performance, security and scalability, and we have added hardware support for virtualization.

That makes us the only company that can get RISC-V into the server and internet backend.

Hardware Vulnerabilities

Posted on 2021-06-04 by: Axel Kloth

Software design is not quite as robust and structured as hardware design. Nevertheless, even in ASIC and processor design sometimes bugs slip through the cracks, and they create problems that can be exploited. We have seen that with Spectre and Meltdown, both of which which exploit an issue with out-of-order execution in conjunction with caching strategies. While Intel and AMD have issued patches, the problem is not fully solved, and users report performance degradation after applying the patches. Other hardware-related vulnerabilities such as Rowhammer and Half-Double make use of knowledge of the physical characteristics in DRAM, and I am fairly certain similar exploits can be devised for Flash. These exploits rely on side effects of mechanisms that were intended to improve processor and memory performance, and they were not foreseeable by the designers of the processors and memories. Nevertheless, they are devastating in their effect. I am certain that security will be one of the more important design considerations for the next generation of processors and memories. We have taken those vulnerabilities into account and have made sure that our processors and intelligent memory subsystems do not exhibit them.

RISC-V in HPC

Posted on 2021-04-20 by: Axel Kloth

SiFive claims that it has taped out a 5 nm TSMC-produced HPC and AI capable processor in its press release here: SiFive RISC-V Proven in 5nm Silicon. In a lot of ways, that is great news as it proves that RISC-V is fully capable of supporting HPC. However, as I have said many times before, the ISA does not matter. Being successful at running HPC workloads breaks down to how accelerators are included, and how memory is attached. My prior blog post ISA versus System Architecture points out what the fallacies are. We believe that there are many more things to HPC than just a CPU made on a 5 nm TSMC process.

Communication, metadata and endpoint security

Posted on 2021-04-20 by: Axel Kloth

Vice reports that Hacking Startup 'Azimuth Security' Unlocked the San Bernardino iPhone. That's not too surprising as the prior suspect did not seem to have the wherewithals to do so. According to the article, "Motherboard can confirm a Washington Post report that said Azimuth Security developed the tool used on the San Bernardino iPhone".

Let me quote some more from the article because I think that particularly the last paragraph is of importance: "Shortly after the FBI successfully accessed the phone, rumours circulated, originating with a single Israeli press report, that established phone-cracking company Cellebrite was behind the hack. Those reports were unsubstantiated, though. After unlocking the device, the FBI found no previously unknown message data or contacts."

Why is that important? The investigators had all of the information they ever needed without cracking the phone or without having access to the encrypted communication (and neither did they need a secret backdoor). All they needed was the metadata - and I had stated that many times. There is no reason whatsoever to try to outlaw strong encryption. See my older post on this here: DOJ on Encryption.

Yet another supply chain attack and breach

Posted on 2021-04-20 by: Axel Kloth

Gizmodo reports that U.S. Federal Investigators Are Reportedly Looking Into Codecov Security Breach, Undetected for Months.

This is another supply chain attack, similar to the one used in what is called the SolarWinds breach. In this case, the attackers appear to have been able to gain access to a system that allows users to upload software to be tested onto a test server, and while that does not sound dangerous, the attackers likely were able to either extract user credentials directly, or added a backdoor that would send them the credentials of anyone who logged in.

If that is the case, then the number of affected users is greater than the 29000 users mentioned. This is in fact dangerous as it balloons. It might not seem obvious, but like with the SolarWinds hack, the direct damage is bad, but the indirect damage is worse. Let's say that a Microsoft super admin's credentials were obtained in the SolarWinds hack, and the admin changed credentials. He or she wanted to make sure that not only new credentials were used, but that the software was updated as well - including MS Exchange and its cloud equivalent. So far the admin did everything right - change credentials and make sure that known vulnerabilities in the software are fixed. Now this admin submits the new code with the new credentials to be tested only to later find out that someone stole his or her new credentials during the upload for 3rd party verification of curing the bugs... This is about as bad as it gets.

For the SolarWinds attack itself, there is plenty of updates and news, such as NPR summarizing it and updating it with some new info, and of course the Biden administration imposing sanctions on the perpetrators of the hack in the following Bloomberg article Biden Sanctions Russia, Restricts Buying New Debt After Hacking but at the same time trying to de-escalate the situation in Biden calls for de-escalation with Russia following sanctions, proposes meeting with Putin.

Neocortex Supercomputer

Posted on 2021-04-03 by: Axel Kloth

It looks like there is some new life in semiconductor companies and in funding for them. It was about time as I fundamentally disagree with the notion I have heard at least one too many times that we have invented all that there is to invent. Current CPUs and all GPGPUs combined do not solve many computationally intensive problems effectively and efficiently. Certainly Cerebras made a splash here, and that is a great sign. The Next Platform reports that national labs are working on the Neocortex Supercomputer to Put Cerebras CS-1 to the Test.

I completely agree that novel solutions to novel computational problems must be found, designed and funded. AI and ML training certainly qualify, but there are plenty of other unsolved problems that do not require wafer-scale compute. While Cerebras and we have different approaches to different problems, I am encouraged to see that funding and the willingness to try out new solutions are available again.

For way too long have analysts and VCs focused on CPUs and GPUs only. It seems that CPUs have settled on x86-64, ARM and RISC-V, and within the GPGPU category nVidia is leading the pack due to CUDA, but DSPs and vision processors for industrial and automotive real-time control applications, security processors and math processors as well as large-scale integer-only database processors are needed to fill the gaps that CPUs and GPGPUs cannot fill.