numpy - Similarity Measure/Matrix for data (recommender system)- Python -
i new machine learning , trying try out following problem. input 2 arrays of descriptions same length, , output array of similarity scores of first string first array compared first string in second array etc.
each item in array(numpy array) string of description. can write function find out how similar between 2 strings calculating how many identical , co-occurring word ids there are, , assign score (one possible weight can based on frequency of co-occurrence vs sum of frequency of individual word id). apply function 2 arrays array of scores. please let me know if there other approaches want to consider well. thanks!
data:
array(['0/1/2/3/4/5/6/5/7/8/9/3/10', '11/12/13/14/15/15/16/17/12', '18/19/20/21/22/23/24/25', '26/27/28/29/30/31/32/33/34/35/36/37/38/39/33/34/40/41', '5/42/43/15/44/45/46/47/48/26/49/50/51/52/49/53/54/51/55/56/22', '57/58/59/60/61/49/62/23/57/58/63/57/58', '64/65/66/63/67/68/69', '70/71/72/73/74/75/76/77', '78/79/80/81/82/83/84/85/86/87/88/89/90/91', '33/34/92/93/94/95/85/96/97/98/99/60/85/86/100/101/102/103', '104/105/106/107/108/109/110/86/107/111/112/113/5/114/110/113/115/116', '117/118/119/120/121/12/122/123/124/125', '14/126/127/128/122/129/130/131/132/29/54/29/129/49/3/133/134/135/136', '137/138/139/140/141/142', '143/144/145/146/147/148/149/150/151/152/4/153/154/155/156/157/158/128/159', '160/161/162/163/131/2/164/165/166/167/168/169/49/170/109/171', '172/173/174/175/176/177/73/178/104/179/180/179/181/173', '182/144/183/179/73', '184/163/68/185/163/8/186/187/188/54/189/190/191', '181/192/0/1/193/194/22/195', '113/196/197/198/68/199/68/200/201/202/203/201', '204/205/206/207/208/209/68/200', '163/210/211/122/212/213/214/215/216/217/100/101/160/139/218/76/179/219', '220/221/222/223/5/68/224/225/54/225/226/227/5/221/222/223', '214/228/5/6/5/215/228/228/229', '230/231/232/233/122/215/128/214/128/234/234', '235/236/191/237/92/93/238/239', '13/14/44/44/240/241/242/49/54/243/244/245/55/56', '220/21/246/38/247/201/248/73/160/249/250/203/201', '214/49/251/252/253/254/255/256/257/258'], dtype='|s127') array(['151/308/309/310/311/215/312/160/313/214/49/12', '314/315/316/275/317/42/318/319/320/212/49/170/179/29/54/29/321/322/323', '324/325/62/220/326/194/327/328/218/76/241/329', '330/29/22/103/331/314/68/80/49', '78/332/85/96/97/227/333/4/334/188', '57/335/336/34/187/337/21/338/212/213/339/340', '341/342/167/343/8/254/154/61/344', '2/292/345/346/42/347/348/348/100/349/202/161/263', '283/39/312/350/26/351', '352/353/33/34/144/218/73/354/355', '137/356/357/358/357/359/22/73/170/87/88/78/123/360/361/53/362', '23/363/10/364/289/68/123/354/355', '188/28/365/149/366/98/367/368/369/370/371/372/368', '373/155/33/34/374/25/113/73', '104/375/81/82/168/169/81/82/18/19', '179/376/377/378/179/87/88/379/20', '380/85/381/333/382/215/128/383/384', '385/129/386/387/388', '389/280/26/27/390/391/302/392/393/165/394/254/302/214/217/395/396', '397/398/291/140/399/211/158/27/400', '401/402/92/93/68/80', '77/129/183/265/403/404/405/406/60/407/162/408/409/410/411/412/413/156', '129/295/90/259/38/39/119/414/415/416/14/318/417/418', '419/420/421/422/423/23/424/241/421/425/58', '426/244/427/5/428/49/76/429/430/431', '257/432/433/167/100/101/434/435/436', '437/167/438/344/356/170', '439/440/441/442/192/443/68/80/444/445/111', '446/312/23/447/448', '385/129/218/449/450/451/22/452/125/129/453/212/128/454/455/456/457/377'], dtype='|s127')
the following code should facilitate need in python 3.x
import numpy np collections import counter def jaccardsim(c1, c2): cu = c1 | c2 ci = c1 & c2 sim = sum(ci.values()) / sum(cu.values()) return sim def bytearraysim(b1, b2): ca = [counter(b1[i].decode(encoding="utf-8", errors="strict").split("/")) in range(len(b1))] cb = [counter(b2[i].decode(encoding="utf-8", errors="strict").split("/")) in range(len(b2))] # assuming both 'a' , 'b' in same length csim = [jaccardsim(ca[i], cb[i]) in range(len(a))] return csim # array of similarities
jaccard similarity score used in implementation. may other scores, such cosine or hamming, liking.
assuming arrays stored in variables a
, b
, resulting function bytearraysim(a,b)
outputs following similarity scores:
[0.0, 0.0, 0.0, 0.038461538461538464, 0.0, 0.041666666666666664, 0.0, 0.0, 0.0, 0.08, 0.0, 0.05555555555555555, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.058823529411764705, 0.0, 0.0, 0.0, 0.05555555555555555, 0.0, 0.0, 0.0, 0.0, 0.0]
Comments
Post a Comment