TPU processor, 16 channels HD video intelligent analysis, 16 channels of full HD video decoding, 10 channels of full HD video encoding
TPU processor, 32 channels HD video intelligent analysis, 32 channels of full HD video decoding, 12 channels of full HD video encoding
RISC-V + ARM intelligent deep learning processor
Based on the RISC-V core, operating at a frequency of 2GHz, the processor features a single SOC with 64 cores and 64MB shared L3 cache.
SRC1-10 is an excellent performance server cluster based on RISC-V arch. It has both computing and storage capabilities, and the full stack of software and hardware is domestically produced.
The RISC-V Fusion Server, supports dual-processor interconnection and enabled intelligent computing acceleration.
SRB1-20 is an excellent performance storage server based on RISC-V arch. It supports CCIX, 128-core concurrent, multi-disk large-capacity secure storage, and the full stack of software and hardware is domestically produced.
SRA1-20 is an excellent performance computing server based on RISC-V arch. It supports CCIX, 128-core concurrent, both software and hardware are open source and controllable.
SRA3-40 is a RISC-V server for high-performance computing, domestic main processor,excellent performance,fusion of intelligent computing, support powerful codec.
SRB3-40 is a high-performance RISC-V storage server with multiple disk slots and large-capacity secure storage.
Intelligent computing server SGM7-40, adapted to mainstream LLM, a single card can run a 70B large language model
SOM1684, BM1684, 16-Channel HD Video Analysis
Core-1684-JD4,BM1684, 16-Channel HD Video Analysis
SBC-6841,BM1684, 16-Channel HD Video Analysis
iCore-1684XQ,BM1684X,32-Channel HD Video Analysis
Core-1684XJD4,BM1684X,32-Channel HD Video Analysis
Shaolin PI SLKY01,BM1684, 16-Channel HD Video Analysis
QY-AIM16T-M,BM1684, 16-Channel HD Video Analysis
QY-AIM16T-M-G,BM1684, 16-Channel HD Video Analysis
QY-AIM16T-W,BM1684, 16-Channel HD Video Analysis
AIV02T,1684*2,Half-Height Half-Length Accelerator Card
AIO-1684JD4,BM1684, 16-Channel HD Video Analysis
AIO-1684XJD4,BM1684X,32-Channel HD Video Analysis
AIO-1684XQ,BM1684X,32-Channel HD Video Analysis
IVP03X,BM1684X,32-Channel HD Video Analysis
IVP03A,Microserver, passive cooling, 12GB RAM
Coeus-3550T,BM1684, 16-Channel HD Video Analysis
EC-1684JD4,BM1684, 16-Channel HD Video Analysis
CSA1-N8S1684,BM1684*8,1U Cluster Server
DZFT-ZDFX,BM1684X,Electronic Seal Analyzer,ARM+DSP architecture
ZNFX-32,BM1684, 16-Channel HD Video Analysis
ZNFX-8,BM1684X,ARM+DSP architecture,Flameproof and Intrinsic Safety Analysis Device
EC-A1684JD4,Microserver with active cooling, 16GB RAM, 32GB eMMC
EC-A1684JD4 FD,BM1684, 16-Channel HD Video Analysis,6GB of RAM, 32GB eMMC
EC-A1684XJD4 FD,BM1684X,32-Channel HD Video Analysis
ECE-S01, BM1684, 16-Channel HD Video Analysis
IOEHM-AIRC01,BM1684,Microserver Active Cooling,16-Channel HD Video Analysis
IOEHM-VCAE01, BM1684, 16-Channel HD Video Analysis
CSA1-N8S1684X,BM1684*8,1U Cluster Server
QY-S1U-16, BM1684, 1U Server
QY-S1U-192, BM1684*12, 1U Cluster Server
QY-S1X-384, BM1684*12, 1U Cluster Server
Deep learning intelligent analysis helps make city management more efficient and precise
Using deep learning video technology to analyze sources of dust generation and dust events, contributing to ecological environmental protection
Using deep learning intelligent analysis to monitor scenarios such as safety production, urban firefighting, and unexpected incidents for emergency regulation.
Using deep learning technology to detect and analyze individuals, vehicles, and security incidents in grassroots governance
Empowering the problems of traffic congestion, driving safety, vehicle violations, and road pollution control
Utilizing domestically developed computational power to support the structured analysis of massive volumes of videos, catering to practical applications in law enforcement
Build a "smart, collaborative, efficient, innovative" gait recognition big data analysis system centered around data
Effectively resolving incidents of objects thrown from height, achieving real-time monitoring of such incidents, pinpointing the location of the thrown object, triggering alerts, and effectively safeguarding the safety of the public from falling objects
Using edge computing architecture to timely and accurately monitor community emergencies and safety hazards
SOPHGO with SOPHON.TEAM ecosystem partners to build a deep learning supervision solution for smart hospitals, enhancing safety management efficiency in hospitals
SOPHGO with SOPHON.TEAM ecosystem partners to build a smart safe campus solution
Using a combination of cloud-edge deep learning methods to address food safety supervision requirements across multiple restaurant establishments, creating a closed-loop supervision system for government and enterprise-level stakeholders
SOPHON's self-developed computing hardware devices, such as SG6/SE5/SE6, equipped with SOPHON.TEAM video analysis algorithms, are used to make industrial safety production become smarter
Combining deep learning, edge computing and other technologies, it has the ability to intelligently identify people, objects, things and their specific behaviors in the refueling area and unloading area. It also automatically detects and captures illegal incidents at gas stations to facilitate effective traceability afterwards and provide data for safety management.
SOPHGO, in collaboration with SOPHON.TEAM and its ecosystem partners, is focusing on three major scene requirements: "Production Safety Supervision," "Comprehensive Park Management," and "Personnel Safety & Behavioral Standard Supervision." Together, they are developing a comprehensive deep learning scenario solution, integrating "algorithm + computing power + platform."
SOPHGO, cooperates with SOPHON.TEAM ecological partners to build a deep learning monitoring solution for safety risks in chemical industry parks
SOPHGO with SOPHON.TEAM ecosystem partners to build a Smart Computing Center solution, establishing a unified management and scheduling cloud-edge collaborative smart computing center
SOPHGO, in collaboration with SOPHON.TEAM ecosystem, have jointly developed a set of hardware leveraging domestically-produced deep learning computational power products. This is based on an AutoML zero-code automated deep learning training platform, enabling rapid and efficient implementation of deep learning engineering solutions
typedef struct {
int left_rows, left_cols, right_cols; //左矩阵行,左矩阵列,右矩阵列
unsigned long long output_addr; //输出矩阵地址
unsigned long long left_addr; //左矩阵地址
unsigned long long right_addr; //右矩阵地址
} __attribute__((packed)) param_t;
param_t params[] = {
{.left_rows = 2, .left_cols = 100352, .right_cols = 2048 }, // 0
{.left_rows = 2, .left_cols = 1280, .right_cols = 1000 }, // 1
{.left_rows = 2, .left_cols = 25088, .right_cols = 4096 }, // 2
{.left_rows = 4, .left_cols = 1024, .right_cols = 25088}, // 3
{.left_rows = 32, .left_cols = 2048, .right_cols = 36 }, // 4
{.left_rows = 64, .left_cols = 9216, .right_cols = 4096 }, // 5
{.left_rows = 79, .left_cols = 256, .right_cols = 4090 }, // 6
{.left_rows = 200, .left_cols = 4096, .right_cols = 324 }, // 7
{.left_rows = 256, .left_cols = 768, .right_cols = 3072 }, // 8
{.left_rows = 256, .left_cols = 3072, .right_cols = 768 }, // 9
{.left_rows = 300, .left_cols = 2048, .right_cols = 80 }, // 10
{.left_rows = 1024, .left_cols = 1024, .right_cols = 1024 }, // 11
{.left_rows = 2048, .left_cols = 4, .right_cols = 1024 }, // 12
{.left_rows = 12544, .left_cols = 2, .right_cols = 1024 }, // 13
{.left_rows = 100352, .left_cols = 1024, .right_cols = 1 }, // 14
};
关于矩阵在local memory中的存储方式,与tensor的存储方式稍有不同,详见算丰文档。在编程时需要注意的是合理分配每个TPU上的数据大小,来尽可能多的利用TPU的算力。
如果左矩阵、右矩阵以及结果矩阵能够同时存放到local memory中,那么所作的事情就比较清晰了:搬入左右矩阵->计算结果->搬出结果矩阵。伪代码如下:
local_addr_t left_addr, right_addr, output_addr;
param_t* param;
matrix_S2L(left_addr, param->left_addr);
matrix_S2L(local_right_addr, param->lright_addr);
okk_bdc_matmul(output_addr, left_addr, right_addr);
matrix_L2S(param->output_addr, output_addr);
所给测试用例中的大部分用例都无法直接存入local memory中,因此,需要利用分块矩阵乘法,对矩阵进行切分。
这里提供三种切分的思路。
如图,当左矩阵较大,而右矩阵较小时,可以考虑按左矩阵的行来进行切分,例如测试用例中的用例14。这样的情况下,可以固定右矩阵,循环搬入左矩阵的待计算部分以及搬出部分输出矩阵。
local_addr_t left_addr, right_addr, output_addr;
param_t* param;
matrix_S2L(right_addr, param->right_addr);
for(int i;i<blocks;i++)
{
matrix_S2L(left_addr, param->left_addr + left_skip_bytes);
okk_bdc_matmul(output_addr, left_addr, right_addr);
matrix_L2S(param->output_addr + output_skip_bytes, output_addr);
}
其中需要注意左矩阵的最后一块,因为左矩阵的行数并不能总是被块数整除,因此最后一个块的大小和之前块的大小会有不同。
同样的,当右矩阵较大,而左矩阵较小时,可以考虑按右矩阵的列进行切分,例如测试用例中的用例4。我们可以固定左矩阵不动,循环搬入右矩阵进行计算,并搬出输出矩阵。
local_addr_t left_addr, right_addr, output_addr;
param_t* param;
matrix_S2L(left_addr, param->left_addr);
for(int i;i<blocks;i++)
{
matrix_S2L(right_addr, param->right_addr + right_skip_bytes, right_cols_in_memory); //right_cols_in_memory控制搬运时内存每行有多少列,这样在搬运一部分的时候会跳过该行而不是只跳过已搬运的元素
okk_bdc_matmul(output_addr, left_addr, right_addr);
matrix_L2S(param->output_addr + output_skip_bytes, output_addr, right_cols_in_memory);
}
这里需要注意的是,矩阵在内存中是按行存储,因此在按左矩阵切分时,可以简单搬运数行,但是在按右矩阵列进行切分时,会有不连续的问题,因此需要在搬入矩阵时,设置在内存中矩阵的列数,以此来仅选取当前块的数据。
此外,还有一种情况,如测试用例中的用例0,由于左矩阵的列以及右矩阵的行都比较大,导致单一矩阵都无法存入TPU,因此需要对左矩阵的列进行切分。对不同部分的运算结果进行累加,并搬运最终的输出矩阵。
local_addr_t left_addr, right_addr, output_addr;
param_t* param;
okk_bdc_set_C(output_addr, 0.0); //将输出矩阵置0,累加初始化
for(int i=0; i<blocks; i++)
{
matrix_S2L(left_addr, param->left_addr + left_skip_bytes);
matrix_S2L(right_addr, param->right_addr + right_skip_bytes);
okk_bdc_matmul(output_addr, left_addr, right_addr, result_add=True); //通过result_add控制是否将结果累加到输出地址上
}
matrix_L2S(para->output_addr, output_addr);
matmul部分没有比较好的特例优化方式,可能主要是控制col_per_NPU这样的参数,来达到尽可能多的利用算力的目的。
通过以上的几张方法,基本能够通过所提供的测试用例。如果遇到更大的矩阵,可以考虑通过行列都切分的方式进行计算。